Please check the latest news (change log) and keep this package updated.
BERT_remove()
: Remove models from local cache
folder.fill_mask()
and fill_mask_check()
:
These functions are only for technical check (i.e., checking the raw
results of fill-mask pipeline). Normal users should usually use
FMAT_run()
.pattern.special
argument for
FMAT_run()
: Regular expression patterns (matching model
names) for special model cases that are uncased or require a special
prefix character in certain situations.
prefix.u2581
: adding prefix \u2581
for
all mask wordsprefix.u0120
: adding prefix \u0120
for
only non-starting mask wordsset_cache_folder()
,
BERT_download()
, BERT_info()
, and
BERT_info_date()
.
BERT_info()
and model initial commit date scraped from
HuggingFace BERT_info_date()
will be saved in subfolders of
local cache: /.info/
and /.date/
,
respectively.FMAT_load()
.library(FMAT)
:
Sys.setenv("HF_HUB_DISABLE_SYMLINKS_WARNING" = "1")
Sys.setenv("TF_ENABLE_ONEDNN_OPTS" = "0")
Sys.setenv("KMP_DUPLICATE_LIB_OK" = "TRUE")
Sys.setenv("OMP_NUM_THREADS" = "1")
set_cache_folder()
: Set (change) HuggingFace
cache folder temporarily.
BERT_info_date()
: Scrape the initial commit date
of BERT models from HuggingFace.BERT_download()
and
BERT_info()
.BERT_download()
connects to the
Internet, while all the other functions run in an offline way.BERT_info()
.add.tokens
and add.method
arguments
for BERT_vocab()
and FMAT_run()
: An
experimental functionality to add new tokens (e.g.,
out-of-vocabulary words, compound words, or even phrases) as [MASK]
options. Validation is still needed for this novel practice (one of my
ongoing projects), so currently please only use at your own risk,
waiting until the publication of my validation work.BERT_download()
now import local
model files only, without automatically downloading models. Users must
first use BERT_download()
to download models.FMAT_load()
: Better to use
FMAT_run()
directly.BERT_vocab()
and ICC_models()
.summary.fmat()
, FMAT_query()
, and
FMAT_run()
(significantly faster because now it can
simultaneously estimate all [MASK] options for each unique
query sentence, with running time only depending on the number of unique
queries but not on the number of [MASK] options).reticulate
package version ≥ 1.36.1,
then FMAT
should be updated to ≥ 2024.4. Otherwise,
out-of-vocabulary [MASK] words may not be identified and marked. Now
FMAT_run()
directly uses model vocabulary and token ID to
match [MASK] words. To check if a [MASK] word is in the model
vocabulary, please use BERT_vocab()
.BERT_download()
(downloading models to local
cache folder “%USERPROFILE%/.cache/huggingface”) to differentiate from
FMAT_load()
(loading saved models from local cache). But
indeed FMAT_load()
can also download models
silently if they have not been downloaded.gpu
argument (see Guidance
for GPU Acceleration) in FMAT_run()
to allow for
specifying an NVIDIA GPU device on which the fill-mask pipeline will be
allocated. GPU roughly performs 3x faster than CPU for the fill-mask
pipeline. By default, FMAT_run()
would automatically detect
and use any available GPU with an installed CUDA-supported Python
torch
package (if not, it would use CPU).FMAT_run()
.BERT_download()
,
FMAT_load()
, and FMAT_run()
.parallel
in FMAT_run()
:
FMAT_run(model.names, data, gpu=TRUE)
is the fastest.progress
in
FMAT_run()
.