semanticfa performs exploratory factor analysis on language model embeddings of psychological scale items. Given item text, it embeds each item, computes a similarity matrix, and extracts latent factors — entirely from the text, with no human response data required.
The package is designed to feel familiar to psych and EFAtools users.
The package ships with the 50-item IPIP Big Five inventory and precomputed sentence-BERT embeddings, so you can try it with zero setup:
library(semanticfa)
data(big5)
fit <- sfa(
big5$items,
nfactors = 5,
embeddings = big5$embeddings,
scoring = big5$scoring
)
print(fit)
#> Semantic Factor Analysis
#> Encoding: atomic
#> Embedding dim: 384
#> Factors:5 (minres + oblimin)
#>
#> Diagnostics:
#> KMO: 0.866 (meritorious - higher is better)
#> TEFI: -47.0724 (lower is better)
#> RMSR: 0.0556 (acceptable - lower is better)
#> CAF: 0.4880 (marginal - higher is better)
#>
#> Factor loadings:
#>
#> Loadings:
#> MR1 MR4 MR3 MR5 MR2
#> item_39 0.617
#> item_35 0.598
#> item_34 0.561
#> item_05 0.531
#> item_32 0.524
#> item_49 0.509
#> item_04 0.492
#> item_28 0.485
#> item_38 0.444
#> item_12 0.438 0.307
#> item_07 0.425
#> item_31 0.379
#> item_02 0.372 0.338
#> item_01 0.366
#> item_33 0.365
#> item_36 0.343
#> item_13
#> item_37
#> item_16 0.601
#> item_48 0.548 0.307
#> item_24 0.530 0.315
#> item_21 0.493 0.417
#> item_23 0.467
#> item_30 0.464
#> item_47 0.444 0.347
#> item_29 0.413 0.351
#> item_15 0.411
#> item_19 0.383 0.377
#> item_06 0.319
#> item_27 0.888
#> item_25 0.708
#> item_08 0.498
#> item_22 0.480
#> item_09 0.418
#> item_10 0.361 0.400
#> item_03 0.366 0.383
#> item_20 0.786
#> item_14 0.686
#> item_18 0.662
#> item_11 0.498
#> item_17 0.485
#> item_26
#> item_44 0.362 0.686
#> item_42 0.672
#> item_50 0.649
#> item_45 0.603
#> item_43 0.528
#> item_46 0.512
#> item_41 0.327 0.390
#> item_40
#>
#> MR1 MR4 MR3 MR5 MR2
#> SS loadings 4.762 3.180 3.289 2.955 3.149
#> Proportion Var 0.095 0.064 0.066 0.059 0.063
#> Cumulative Var 0.095 0.159 0.225 0.284 0.347
#>
#> Factor correlations (Phi):
#> MR1 MR4 MR3 MR5 MR2
#> MR1 1.000 0.381 0.227 0.337 0.276
#> MR4 0.381 1.000 0.336 0.358 0.129
#> MR3 0.227 0.336 1.000 0.202 0.183
#> MR5 0.337 0.358 0.202 1.000 0.052
#> MR2 0.276 0.129 0.183 0.052 1.000
#>
#> Variance accounted for:
#> MR1 MR4 MR3 MR5 MR2
#> SS loadings 5.968 4.208 3.880 3.572 3.524
#> Proportion Var 0.119 0.084 0.078 0.071 0.070
#> Cumulative Var 0.119 0.204 0.281 0.353 0.423
#> Proportion Explained 0.282 0.199 0.183 0.169 0.167
#> Cumulative Proportion 0.282 0.481 0.665 0.833 1.000When you omit nfactors, sfa() uses embedding-adapted parallel analysis (random unit vectors in the embedding dimension as the null):
fit_auto <- sfa(
big5$items,
embeddings = big5$embeddings,
scoring = big5$scoring
)
cat("Auto-detected factors:", fit_auto$factors, "\n")
#> Auto-detected factors: 8For a multi-method comparison, use sfa_nfactors():
sim <- sfa_similarity(big5$embeddings, encoding = "atomic_reversed",
scoring = big5$scoring)
nf <- sfa_nfactors(sim, big5$embeddings,
methods = c("parallel", "kaiser"),
parallel_iter = 50)
print(nf)
#> Factor retention analysis (embedding-adapted)
#>
#> Method n_factors
#> parallel 8
#> kaiser 13
#> ------------------------
#> Consensus 8
#>
#> Eigenvalues: 13.5 3.5 2.9 2.1 1.8 1.7 1.6 1.5 1.3 1.2 ...The encoding argument controls how embeddings become a similarity matrix:
sim_ar <- sfa_similarity(big5$embeddings, "atomic_reversed", big5$scoring)
sim_sq <- sfa_similarity(big5$embeddings, "squid", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "squid": this method is keying-free by design. Use "atomic_reversed" for keyed
#> sign-flipping.
sim_mcp <- sfa_similarity(big5$embeddings, "mean_centered_pearson", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "mean_centered_pearson": this method is keying-free by design. Use
#> "atomic_reversed" for keyed sign-flipping.
cat("atomic_reversed range:", range(sim_ar[lower.tri(sim_ar)]), "\n")
#> atomic_reversed range: -0.8933168 0.7684432
cat("squid range: ", range(sim_sq[lower.tri(sim_sq)]), "\n")
#> squid range: -0.3360389 0.8676635
cat("mean_centered_pearson:", range(sim_mcp[lower.tri(sim_mcp)]), "\n")
#> mean_centered_pearson: -0.02542094 0.8932819SQuID and mean-centered Pearson recover negative correlations between reverse-keyed dimensions — atomic_reversed does not.
plot(fit, type = "scree")Scree plot with parallel analysis threshold
plot(fit, type = "loadings")Factor loading heatmap
The $loadings component works directly with psych functions:
# Run human-data EFA (not run — requires response data)
human_fit <- psych::fa(response_data, nfactors = 5, rotate = "oblimin")
# Compare
psych::factor.congruence(fit$loadings, human_fit$loadings)For NMI, ARI, Frobenius, and disattenuated correlation:
cong <- sfa_congruence(fit, big5$factors, metrics = c("nmi", "ari"))
print(cong)
#> Factor structure congruence
#>
#> NMI: 0.428 (weak - higher is better)
#> ARI: 0.257 (poor - higher is better)Pass any embedding model’s output via embeddings=:
# With sentence-transformers (requires reticulate + Python).
# The default model is "Qwen/Qwen3-Embedding-0.6B"; larger models such as
# "Qwen/Qwen3-Embedding-4B" (8 GB RAM) or "Qwen/Qwen3-Embedding-8B" (16 GB RAM)
# recover factor structure more accurately.
emb <- sfa_embed(my_items, embed = "sbert", model = "Qwen/Qwen3-Embedding-0.6B")
fit <- sfa(my_items, embeddings = emb, scoring = my_scoring)
# Or bring your own function
my_embedder <- function(texts) {
# ... your embedding logic ...
# must return a numeric matrix (n_items x dim)
}
fit <- sfa(my_items, embed = my_embedder, scoring = my_scoring)