Title: | Resampling Algorithms for Multi-Label Datasets |
Version: | 0.2.3 |
Description: | Collection of the state of the art multi-label resampling algorithms. The objective of these algorithms is to achieve balance in multi-label datasets. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
Imports: | data.table, e1071, mldr, pbapply, vecsets |
Suggests: | parallel |
NeedsCompilation: | no |
Packaged: | 2023-08-22 09:28:44 UTC; mdavila |
Author: | Miguel Ángel Dávila [cre],
Francisco Charte |
Maintainer: | Miguel Ángel Dávila <madr0008@red.ujaen.es> |
Repository: | CRAN |
Date/Publication: | 2023-08-22 12:20:02 UTC |
Randomly clones instances with minoritary labelsets
Description
This function implements the LP-ROS algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with minoritary labels, and randomly clone them.
Usage
LPROS(D, P)
Arguments
D |
mld |
P |
Percentage in which the original dataset is increased |
Value
A mld object containing the preprocessed multilabel dataset
Source
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3-16.
Examples
library(mldr)
LPROS(birds, 25)
Randomly deletes instances with majoritary labelsets
Description
This function implements the LP-RUS algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with majoritary labelsets, and randomly delete them from the original dataset.
Usage
LPRUS(D, P)
Arguments
D |
mld |
P |
Percentage in which the original dataset is increased |
Value
A mld object containing the preprocessed multilabel dataset
Source
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3-16.
Examples
library(mldr)
LPRUS(birds, 25)
Randomly clones instances with minoritary labels
Description
This function implements the ML-ROS algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with minoritary labels, and randomly clone them.
Usage
MLROS(D, P)
Arguments
D |
mld |
P |
Percentage in which the original dataset is increased |
Value
A mld object containing the preprocessed multilabel dataset
Source
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3-16.
Examples
library(mldr)
library(mldr.resampling)
MLROS(birds, 25)
Randomly deletes instances with majoritary labels
Description
This function implements the ML-RUS algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with majoritary labels, and randomly delete them from the original dataset.
Usage
MLRUS(D, P)
Arguments
D |
mld |
P |
Percentage in which the original dataset is increased |
Value
A mld object containing the preprocessed multilabel dataset
Source
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163, 3-16.
Examples
library(mldr)
MLRUS(birds, 25)
Reverse-nearest neighborhood based oversampling for imbalanced, multi-label datasets
Description
This function implements an algorithm that uses the concept of reverse nearest neighbors, in order to create new instances for each label. Then, several radial SVMs, one for each label, are trained in order to predict each label of the synthetic instances.
Usage
MLRkNNOS(D, k, tableVDM = NULL)
Arguments
D |
mld |
k |
Number of neighbors to be considered when creating a synthetic instance |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
A mld object containing the preprocessed multilabel dataset
Source
Sadhukhan, P., & Palit, S. (2019). Reverse-nearest neighborhood based oversampling for imbalanced, multi-label datasets. Pattern Recognition Letters, 125, 813-820
Synthetic oversampling of multilabel instances (MLSMOTE)
Description
This function implements the MLSMOTE algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with minoritary labels, and generate synthetic instances based on their neighbor instances.
Usage
MLSMOTE(D, k, strategy = "ranking", tableVDM = NULL)
Arguments
D |
mld |
k |
Number of neighbors to be considered when creating a synthetic instance |
strategy |
Strategy for choosing the synthetic labels. Possible values: "union", "intersection" and "ranking" (default) |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
A mld object containing the preprocessed multilabel dataset
Source
Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2015). MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems, 89, 385-397.
Multi-label oversampling based on local label imbalance (MLSOL)
Description
This function implements the MLSOL algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, which applies oversampling on difficult regions of the instance space, in order to help classifiers distinguish labels.
Usage
MLSOL(D, P, k, neighbors = NULL, tableVDM = NULL)
Arguments
D |
mld |
P |
Percentage in which the original dataset is increased |
k |
Number of neighbors to be considered when computing the neighbors of an instance |
neighbors |
Structure with all instances and neighbors in the dataset. If it is empty, it will be calculated by the function |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
A mld object containing the preprocessed multilabel dataset
Source
Liu, B., Blekas, K., & Tsoumakas, G. (2022). Multi-label sampling based on local label imbalance. Pattern Recognition, 122, 108294.
Multilabel approach for the Tomek Link undersampling algorithm (MLTL)
Description
This function implements the MLTL algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify tomek links (majoritary instances with a very different neighbor), and remove them. It's like MLeNN, with the number of neighbors being 1.
Usage
MLTL(D, TH, neighbors = NULL, tableVDM = NULL)
Arguments
D |
mld |
TH |
threshold for the Hamming Distance in order to consider an instance different to another one. |
neighbors |
Structure with instances and neighbors. If it is empty, it will be calculated by the function |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
An mldr object containing the preprocessed multilabel dataset
Source
Pereira, R. M., Costa, Y. M., & Silla Jr, C. N. (2020). MLTL: A multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing, 383, 95-105.
Multi-label undersampling based on local label imbalance (MLUL)
Description
This function implements the MLUL algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, which applies undersampling, removing difficult instances according to their neighbors.
Usage
MLUL(D, P, k, neighbors = NULL, tableVDM = NULL)
Arguments
D |
mld |
P |
Percentage in which the original dataset is decreased |
k |
Number of neighbors to be considered when computing the neighbors of an instance |
neighbors |
Structure with all instances and neighbors in the dataset. If it is empty, it will be calculated by the function |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
A mld object containing the preprocessed multilabel dataset
Source
Liu, B., Blekas, K., & Tsoumakas, G. (2022). Multi-label sampling based on local label imbalance. Pattern Recognition, 122, 108294.
Multilabel edited Nearest Neighbor (MLeNN)
Description
This function implements the MLeNN algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to identify instances with majoritary labels, and remove its neihgbors which are too different to them, in terms of active labels.
Usage
MLeNN(D, TH = 0.5, k = 3, neighbors = NULL, tableVDM = NULL)
Arguments
D |
mld |
TH |
threshold for the Hamming Distance in order to consider an instance different to another one. Defaults to 0.5. |
k |
number of nearest neighbours to check for each instance. Defaults to 3. |
neighbors |
Structure with instances and neighbors. If it is empty, it will be calculated by the function |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
An mldr object containing the preprocessed multilabel dataset
Source
Francisco Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera. MLeNN: A First Approach to Heuristic Multilabel Undersampling. Intelligent Data Engineering and Automated Learning – IDEAL 2014. ISBN 978-3-319-10840-7.
Decouples highly imbalanced labels
Description
This function implements the REMEDIAL algorithm. It is a preprocessing algorithm for imbalanced multilabel datasets, whose aim is to decouple frequent and rare classes appearing in the same instance. For doing so, it aggregates new instances to the dataset and edit the labels present in them.
Usage
REMEDIAL(mld)
Arguments
mld |
|
Value
An mldr object containing the preprocessed multilabel dataset
Source
F. Charte, A. J. Rivera, M. J. del Jesus, F. Herrera. "Resampling Multilabel Datasets by Decoupling Highly Imbalanced Labels". Proc. 2015 International Conference on Hybrid Artificial Intelligent Systems (HAIS 2015), pp. 489-501, Bilbao, Spain, 2015. Implementation from the original mldr
package
Examples
library(mldr)
REMEDIAL(birds)
Auxiliary function used by MLeNN. Computes the Hamming Distance between two instances
Description
Auxiliary function used by MLeNN. Computes the Hamming Distance between two instances
Usage
adjustedHammingDist(x, y, D)
Arguments
x |
Index of sample 1 |
y |
Index of sample 2 |
D |
mld |
Value
The Hamming Distance between the instances
Auxiliary function used to calculate the distances between an instance and the ones with a specific active label. Euclidean distance is calculated for numeric attributes, and VDM for non numeric ones.
Description
Auxiliary function used to calculate the distances between an instance and the ones with a specific active label. Euclidean distance is calculated for numeric attributes, and VDM for non numeric ones.
Usage
calculateDistances(sample, rest, label, D, tableVDM = NULL)
Arguments
sample |
Index of the sample whose distances to other samples we want to know |
rest |
Indexes of the samples to which we will calculate the distance |
label |
Label that must be active |
D |
mld |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
A list with the distance to the rest of samples
Auxiliary function used to calculate an auxiliary table to make VDM calculation faster
Description
Auxiliary function used to calculate an auxiliary table to make VDM calculation faster
Usage
calculateTableVDM(D)
Arguments
D |
mld |
Value
A dataframe with tables, useful for VDM calculation
Auxiliary function used by resample. It executes an algorithm, given as a string, and stores the resulting MLD in a arff file
Description
Auxiliary function used by resample. It executes an algorithm, given as a string, and stores the resulting MLD in a arff file
Usage
executeAlgorithm(
D,
a,
P,
k,
TH,
strategy,
outputDirectory,
neighbors,
neighbors2,
tableVDM
)
Arguments
D |
mld |
a |
String with the name of the algorithm to be applied. |
P |
Percentage in which the original dataset is increased/decreased (if required by the algorithm) |
k |
Number of neighbors taken into account for each instance (if required by the algorithm) |
TH |
Threshold for the Hamming Distance in order to consider an instance different to another one (if required by the algorithm) |
strategy |
Strategy for choosing the synthetic labels (if required by the algorithm). Possible values: "union", "intersection" and "ranking" (default) |
outputDirectory |
Route with the directory where the generated ARFF file will be stored |
neighbors |
Structure with all instances and neighbors in the dataset, useful in MLSOL and MLUL |
neighbors2 |
Structure with some instances and neighbors in the dataset, useful in MLeNN and MLTL |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
Time (in seconds) taken to execute the algorithm (NULL if no algorithm was executed)
Auxiliary function used by MLSOL. Creates a synthetic sample based on two other samples, taking into account their types
Description
Auxiliary function used by MLSOL. Creates a synthetic sample based on two other samples, taking into account their types
Usage
generateInstanceMLSOL(seedInstance, refNeigh, t, D)
Arguments
seedInstance |
Index of the sample we are using as "template" |
refNeigh |
Index of the reference neighbor |
t |
types of the instances |
D |
mld |
Value
A synthetic sample derived from the one passed as a parameter and its neighbors
Auxiliary function used by MLSOL and MLUL. Computes the kNN of every instance in a dataset
Description
Auxiliary function used by MLSOL and MLUL. Computes the kNN of every instance in a dataset
Usage
getAllNeighbors(D, d, tableVDM = NULL)
Arguments
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
A list of vectors with the indexes of the neighbors for each instance
Auxiliary function used by MLeNN and MLTL. Gets the kNN of every instance in a dataset, when compared to some of the rest
Description
Auxiliary function used by MLeNN and MLTL. Gets the kNN of every instance in a dataset, when compared to some of the rest
Usage
getAllNeighbors2(neighbors, d, k)
Arguments
neighbors |
Structure with all the neighbors in the dataset, regardless of which ones to be compared |
d |
Vector with the instances of the dataset which are going to be compared |
k |
Number of neighbors to be retrieved |
Value
A list of vectors with the indexes of the neighbors for each instance
Auxiliary function used by MLUL. For each instance in the dataset, given the neighbors structure, we compute its reverse nearest neighbors
Description
Auxiliary function used by MLUL. For each instance in the dataset, given the neighbors structure, we compute its reverse nearest neighbors
Usage
getAllReverseNeighbors(d, neighbors, k)
Arguments
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
neighbors |
Structure with the neighbors of every instance in the dataset |
k |
Number of neighbors to be considered |
Value
A list of vectors with the indexes of the reverse nearest neighbors of every instance in the dataset
Auxiliary function used by MLSOL and MLUL. For each instance in the dataset, we compute, for each label, the proportion of neighbors having an opposite class with respect to the proper instance
Description
Auxiliary function used by MLSOL and MLUL. For each instance in the dataset, we compute, for each label, the proportion of neighbors having an opposite class with respect to the proper instance
Usage
getC(D, d, neighbors, k)
Arguments
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
neighbors |
Structure with the neighbors of every instance in the dataset |
k |
Number of neighbors taken into account for each instance |
Value
A structure with the proportion of neighbors having an opposite class with respect to an instance and label
Auxiliary function used to compute the neighbors of an instance
Description
Auxiliary function used to compute the neighbors of an instance
Usage
getNN(sample, rest, label, D, tableVDM = NULL)
Arguments
sample |
Index of the sample whose neighbors we want to know |
rest |
Indexes of the samples among which we will search |
label |
Label that must be active, in order to calculate the distances |
D |
mld |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
A vector with the indexes inside rest of the neighbors
Get the number of cores available for parallel computing
Description
Get the number of cores available for parallel computing
Usage
getNumCores()
Value
The number of cores available for parallel computing
Examples
getNumCores()
Auxiliary function used by MLSOL and MLUL. For non outlier instances, it aggregates the values of C, taking into account the global class imbalance
Description
Auxiliary function used by MLSOL and MLUL. For non outlier instances, it aggregates the values of C, taking into account the global class imbalance
Usage
getS(D, d, C, minoritary)
Arguments
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
C |
Structure with the proportion of neighbors having an opposite class with respect to an instance and label |
minoritary |
Vector with the minoritary class of each label (normally, 1) |
Value
A structure with the proportion of neighbors having an opposite class with respect to an instance and label, normalized by the global class imbalance
Auxiliary function used by MLUL. It computes the influence of each instance with respect to its reverse neighbors
Description
Auxiliary function used by MLUL. It computes the influence of each instance with respect to its reverse neighbors
Usage
getU(D, d, rNeighbors, S)
Arguments
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
rNeighbors |
Structure with the reverse nearest neighbors of each instance of the dataset |
S |
Structure with the proportion of neighbors having an opposite class with respect to an instance and label, normalized by the global class imbalance |
Value
A list of values of influence for each instance with respect to its reverse neighbors
Auxiliary function used by MLUL. It calculates, for each instance, how important it is in the dataset
Description
Auxiliary function used by MLUL. It calculates, for each instance, how important it is in the dataset
Usage
getV(w, u)
Arguments
w |
List of weights for each instance |
u |
List of influences in reverse neighbors for each instance |
Value
A list with the values of importance of each instance in the dataset
Auxiliary function used by MLSOL and MLUL. For non outlier instances, it aggregates the values of S for each label
Description
Auxiliary function used by MLSOL and MLUL. For non outlier instances, it aggregates the values of S for each label
Usage
getW(S)
Arguments
S |
Structure with the proportion of neighbors having an opposite class with respect to an instance and label, normalized by the global class imbalance |
Value
A vector of weights to be considered when oversampling for each instance
Auxiliary function used by MLSOL. Categorizes each pair instance-label of the dataset with a type
Description
Auxiliary function used by MLSOL. Categorizes each pair instance-label of the dataset with a type
Usage
initTypes(C, neighbors, k, minoritary, D, d)
Arguments
C |
List of vectors with one value for each pair instance-label |
neighbors |
Structure with the k nearest neighbors of each instance of the dataset |
k |
Number of neighbors to be considered for each instance |
minoritary |
Vector with the minoritary value of each label (normally, 1) |
D |
mld |
d |
Vector with the instances of the dataset which have one or more label active (ideally, all of them) |
Value
A synthetic sample derived from the one passed as a parameter and its neighbors
Auxiliary function used by MLSMOTE. Creates a synthetic sample based on values of attributes and labels of its neighbors
Description
Auxiliary function used by MLSMOTE. Creates a synthetic sample based on values of attributes and labels of its neighbors
Usage
newSample(seedInstance, refNeigh, neighbors, strategy, D)
Arguments
seedInstance |
Sample we are using as "template" |
refNeigh |
Reference neighbor |
neighbors |
Neighbors to take into account |
strategy |
Strategy for choosing the synthetic labels: union, intersection or ranking |
D |
mld |
Value
A synthetic sample derived from the one passed as a parameter and its neighbors
Interface function of the package. It executes one or several algorithms, given as strings, and stores the resulting MLDs in arff files
Description
Interface function of the package. It executes one or several algorithms, given as strings, and stores the resulting MLDs in arff files
Usage
resample(
D,
algorithms,
P = 25,
k = 3,
TH = 0.5,
strategy = "ranking",
params,
outputDirectory = tempdir()
)
Arguments
D |
mld |
algorithms |
String, or string vector, with the name(s) of the algorithm(s) to be applied. |
P |
Percentage in which the original dataset is increased/decreased, if required by the algorithm(s). Defaults to 25 |
k |
Number of neighbors taken into account for each instance, if required by the algorithm(s). Defaults to 3 |
TH |
Threshold for the Hamming Distance in order to consider an instance different to another one, if required by the algorithm(s). Defaults to 0.5 |
strategy |
Strategy for choosing the synthetic labels, if required by the algorithm. Defaults to ranking |
params |
Dataframe with 4 columns: name of the algorithm, P, k and TH, in that order, to execute several algorithms with different values for their parameters |
outputDirectory |
Route with the directory where generated ARFF files will be stored. Defaults to a temporary directory |
Value
Dataframe with times (in seconds) taken in to execute each algorithm
Examples
library(mldr)
library(mldr.resampling)
resample(birds, "LPROS", P=25)
resample(birds, c("LPROS", "LPRUS"), P=30)
Set the number of cores available for parallel computing
Description
Set the number of cores available for parallel computing
Usage
setNumCores(n)
Arguments
n |
The new value for the number of cores |
Value
No return value, called in order to change the number of cores
Examples
setNumCores(8)
Enable/Disable parallel computing
Description
Enable/Disable parallel computing
Usage
setParallel(beParallel)
Arguments
beParallel |
A boolean indicating if parallel computing is to be enabled (TRUE) or disabled (FALSE) |
Value
No return value, called in order to enable parallel computing
Examples
setParallel(TRUE)
Auxiliary function used to calculate the Value Difference Metric (VDM) between two instances considering their non numeric attributes
Description
Auxiliary function used to calculate the Value Difference Metric (VDM) between two instances considering their non numeric attributes
Usage
vdm(D, sample, y, label, tableVDM = NULL)
Arguments
D |
mld |
sample |
Index of the first sample |
y |
Index of the second sample |
label |
Label that will be considered in calculations |
tableVDM |
Dataframe object containing previous calculations for faster processing. If it is empty, the algorithm will be slower |
Value
A value for the distance