Why Your Keys Don’t Match

Gilles Colling

2026-03-31

A join runs without error but the row count is off – rows vanished, or they multiplied. The columns look fine. The key values look identical in the console. str() reveals nothing useful.

The problem is almost never the join logic. It is the strings. R’s merge() and dplyr’s *_join() compare key values byte-for-byte. When keys fail to match, they are genuinely different – the difference is just invisible: a trailing space, a case mismatch, a zero-width Unicode character with no visual representation.

This vignette walks through five scenarios where joins fail for string-level reasons that resist casual inspection. The data is synthetic; the patterns come from real pipelines.

Scenario 1: The Excel Export

A retail analytics team receives monthly sales data from a distribution partner as a CSV exported from Excel. They join it against their internal customer database on customer_id. For two quarters, everything works. Then one month, 30% of the sales records stop matching. Nobody changed the code or the customer database. The partner’s IDs are all present in the internal system – or so it appears.

partner_sales <- data.frame(
  customer_id = c("CUST-1001", "CUST-1002 ", "CUST-1003",
                  " CUST-1004", "CUST-1005 ", "CUST-1006"),
  amount = c(2500, 1800, 3200, 950, 4100, 1600),
  stringsAsFactors = FALSE
)

internal_db <- data.frame(
  customer_id = c("CUST-1001", "CUST-1002", "CUST-1003",
                  "CUST-1004", "CUST-1005", "CUST-1006", "CUST-1007"),
  region = c("West", "East", "West", "South", "East", "North", "West"),
  stringsAsFactors = FALSE
)

Nothing in str() or print() reveals the issue – trailing spaces are invisible in console output. join_spy() catches it:

report <- join_spy(partner_sales, internal_db, by = "customer_id")

Three of the six partner IDs carry whitespace. "CUST-1002 " is a different string from "CUST-1002" as far as R is concerned. join_repair() trims both tables at once:

repaired <- join_repair(partner_sales, internal_db, by = "customer_id")
#> ✔ Repaired 3 value(s)
partner_fixed <- repaired$x
internal_fixed <- repaired$y

We can verify the repair worked:

key_check(partner_fixed, internal_fixed, by = "customer_id")
#> ✔ Key check passed: no issues detected

And now the join gives us what we expected:

result <- merge(partner_fixed, internal_fixed, by = "customer_id")
nrow(result)
#> [1] 6

The root cause was an Excel CONCATENATE formula that preserved trailing spaces from a variable-width source column. Excel renders "CUST-1002" and "CUST-1002 " identically, so nobody noticed. Trailing whitespace is the single most common join failure we see in practice.

Scenario 2: Two Databases, Two Conventions

A SaaS company wants to join CRM profiles to clickstream events for a churn analysis. The CRM stores email addresses in uppercase (a database migration decision from the late 1990s). The web app stores them in lowercase. Both systems are internally consistent.

crm_profiles <- data.frame(
  email = c("ALICE@ACME.COM", "BOB@ACME.COM", "CAROL@ACME.COM",
            "DAVE@ACME.COM", "EVE@ACME.COM"),
  plan = c("enterprise", "starter", "pro", "enterprise", "starter"),
  stringsAsFactors = FALSE
)

click_events <- data.frame(
  email = c("alice@acme.com", "bob@acme.com", "carol@acme.com",
            "dave@acme.com", "frank@acme.com"),
  page_views = c(47, 12, 89, 33, 5),
  stringsAsFactors = FALSE
)

An inner join returns zero rows. R’s string comparison is case-sensitive, so every key pair fails. join_spy() flags the situation before the join:

report <- join_spy(crm_profiles, click_events, by = "email")

suggest_repairs() generates the fix:

suggest_repairs(report)
#> 
#> ── Suggested Repairs ───────────────────────────────────────────────────────────
#> # Standardize case:
#> x$email <- tolower(x$email)
#> y$email <- tolower(y$email)

Or we can use join_repair() directly, specifying case standardization:

repaired <- join_repair(
  crm_profiles, click_events,
  by = "email",
  standardize_case = "lower"
)
#> ✔ Repaired 5 value(s)

After lowercasing both sides, the inner join returns four matched rows (everyone except Eve, who has no click data, and Frank, who is not in the CRM):

result <- merge(repaired$x, repaired$y, by = "email")
nrow(result)
#> [1] 4
result
#>            email       plan page_views
#> 1 alice@acme.com enterprise         47
#> 2   bob@acme.com    starter         12
#> 3 carol@acme.com        pro         89
#> 4  dave@acme.com enterprise         33

Email addresses are case-insensitive by RFC 5321, so lowercasing is the right normalization here. For other identifier types (product codes, country abbreviations), "upper" may be more appropriate.

Scenario 3: The PDF Copy-Paste

A public health researcher compiles data from multiple sources for a systematic review. A few studies published supplementary tables only as PDF, so she copies the table from the PDF viewer, pastes into a spreadsheet, cleans up the columns, and reads the CSV into R. The data looks perfect – every country name is spelled correctly. But half the countries fail to match a reference population table.

# Simulating PDF copy-paste artifacts:
# \u00A0 is non-breaking space, \u200B is zero-width space
pdf_data <- data.frame(
  country = c("Brazil", "India\u200B", "Germany",
              "Japan\u00A0", "Canada", "France\u200B"),
  prevalence = c(12.3, 8.7, 5.1, 3.9, 6.2, 4.8),
  stringsAsFactors = FALSE
)

reference <- data.frame(
  country = c("Brazil", "India", "Germany", "Japan",
              "Canada", "France", "Italy"),
  population_m = c(214, 1408, 84, 125, 38, 68, 59),
  stringsAsFactors = FALSE
)

Printing the PDF data shows nothing wrong:

pdf_data$country
#> [1] "Brazil"  "India​"   "Germany" "Japan "  "Canada"  "France​"

The zero-width space after “India” and “France” occupies zero pixels. The non-breaking space after “Japan” renders like a regular space but is U+00A0, not U+0020 – trimws() will not always remove it. The merge shows the damage:

nrow(merge(pdf_data, reference, by = "country"))
#> [1] 3

Three of six countries match. join_spy() detects the invisible characters:

report <- join_spy(pdf_data, reference, by = "country")
repaired <- join_repair(pdf_data, reference, by = "country")
#> ✔ Repaired 3 value(s)
nrow(merge(repaired$x, repaired$y, by = "country"))
#> [1] 6

Six matches. PDF copy-paste is the most common source of these artifacts, but web scraping, OCR output, and legacy mainframe exports can produce them too. One useful debugging trick outside of joinspy: nchar("India\u200B") returns 6, not 5. But that requires already suspecting the problem.

Scenario 4: The Slowly Growing Mismatch

An e-commerce pipeline joins transaction records to a product catalogue. The pipeline ran cleanly for months, then match rates started drifting: 99% in January, 97% in February, 94% in March. Nobody noticed until finance flagged a margin discrepancy in April.

The code had not changed. A new data entry clerk had joined the warehouse team in December. The canonical product code format was "WDG-100" – uppercase prefix, dash, three-digit suffix. The new clerk sometimes omitted the dash, sometimes typed lowercase. The warehouse system did fuzzy matching internally, so it accepted the codes. The ETL join did not.

# Product catalogue (canonical format)
catalogue <- data.frame(
  product_code = c("WDG-100", "WDG-101", "WDG-102",
                   "WDG-103", "WDG-104", "WDG-105"),
  product_name = c("Widget A", "Widget B", "Widget C",
                   "Widget D", "Widget E", "Widget F"),
  margin = c(0.35, 0.28, 0.42, 0.31, 0.39, 0.25),
  stringsAsFactors = FALSE
)

# Recent transactions (mix of old and new clerk entries)
transactions <- data.frame(
  product_code = c("WDG-100", "WDG-101", "WDG102",
                   "wdg-103", "WDG-104", "wdg105",
                   "WDG-100", "WDG103"),
  quantity = c(5, 3, 7, 2, 4, 6, 1, 8),
  stringsAsFactors = FALSE
)

The damage is partial – some codes match, some do not – which makes it harder to spot than a complete failure.

report <- join_spy(transactions, catalogue, by = "product_code")

Here is where this scenario differs from the previous ones. join_repair() can fix the case issue, but it cannot insert the missing dashes – that requires domain knowledge about the code format.

We can do a dry run to see what join_repair() would fix:

join_repair(transactions, catalogue,
            by = "product_code",
            standardize_case = "upper",
            dry_run = TRUE)
#> 
#> ── Repair Preview (Dry Run) ────────────────────────────────────────────────────
#> 
#> ── Left table (x) ──
#> 
#> ℹ product_code: upper case (2)

After applying the mechanical fixes:

repaired <- join_repair(transactions, catalogue,
                        by = "product_code",
                        standardize_case = "upper")
#> ✔ Repaired 2 value(s)

The case issues are resolved, but the missing dashes remain. A manual transformation handles those:

# Manual fix: insert dash if missing in product codes matching the pattern
fix_codes <- function(codes) {
  gsub("^([A-Z]{3})(\\d)", "\\1-\\2", codes)
}
repaired$x$product_code <- fix_codes(repaired$x$product_code)
result <- merge(repaired$x, repaired$y, by = "product_code")
nrow(result)
#> [1] 8

All eight transactions match. join_repair() handles context-free transformations (trimming, case normalization, stripping invisible characters). Inserting a dash into "WDG102" requires knowing the canonical format – that fix has to come from someone who understands the data.

Scenario 5: The Compound Key Trap

Two government datasets need to be linked: regional economic indicators and regional population estimates, keyed on region and year. The year column is numeric and matches without trouble. The region column has a whitespace problem that affects only some records.

economics <- data.frame(
  region = c("North America", "Europe", "Asia Pacific ",
             "North America", "Europe", "Asia Pacific "),
  year = c(2022, 2022, 2022, 2023, 2023, 2023),
  gdp_growth = c(2.1, 1.8, 4.2, 1.9, 0.9, 3.8),
  stringsAsFactors = FALSE
)

population <- data.frame(
  region = c("North America", "Europe", "Asia Pacific",
             "North America", "Europe", "Asia Pacific"),
  year = c(2022, 2022, 2022, 2023, 2023, 2023),
  pop_millions = c(580, 450, 4300, 585, 448, 4350),
  stringsAsFactors = FALSE
)

In a compound key join, both columns must match. The year column is fine everywhere, but "Asia Pacific " with a trailing space will not match "Asia Pacific".

merged <- merge(economics, population, by = c("region", "year"))
nrow(merged)
#> [1] 4

Four rows instead of six. North America and Europe link correctly; Asia Pacific silently vanishes. Two-thirds of the data is present – just enough to look plausible.

join_spy() pinpoints which column in the compound key has the problem:

report <- join_spy(economics, population, by = c("region", "year"))
repaired <- join_repair(economics, population, by = c("region", "year"))
#> ✔ Repaired 2 value(s)
result <- merge(repaired$x, repaired$y, by = c("region", "year"))
nrow(result)
#> [1] 6

Six rows. Compound keys are more fragile than single keys because the failure is partial: one clean column and one dirty column produces a result that looks reasonable but is quietly incomplete. The more columns in the key, the more places an invisible character can hide.

The Pattern

These five scenarios share three properties. The data looks correct to standard inspection tools – str(), summary(), print() all render the values identically. The failure is silent – R does not warn when a merge drops rows because of invisible whitespace; it just returns fewer rows. And the fix is mechanical once the cause is known – trimming, lowercasing, or stripping invisible Unicode are all one-line operations.

join_spy() closes the gap between a thirty-second fix and the hours of confusion that precede it, particularly with data from external sources, manual entry, PDF extraction, or cross-system integrations.

mirror server hosted at Truenetwork, Russian Federation.