← Back to blog
AI & ML 5 min read

How LLMs Can Identify Anonymous Users at Scale: What Businesses Need to Know

LLMs can now identify anonymous users with 68% accuracy. Learn what businesses must do to protect user anonymity and mitigate deanonymization risks.

How LLMs Can Identify Anonymous Users at Scale: What Businesses Need to Know

The End of Online Anonymity: LLMs Can Now Identify Users from Their Writing

Businesses that promise user anonymity face a new threat. Research from ETH Zurich and Anthropic demonstrates that large language models can identify anonymous users with up to 68% accuracy at 90% precision. What previously required hours of human investigation now takes minutes of automated processing.

Put simply: if your platform allows pseudonymous participation — from employee feedback systems to customer forums — that anonymity is no longer reliable.

Why This Matters for Your Business

Every platform handling anonymous data now carries elevated risk. According to research published on arXiv, LLMs can match users across platforms by analyzing just "a handful of comments." The attack costs $1-4 per attempt, making large-scale deanonymization economically viable for the first time.

Real numbers: When matching Hacker News profiles to LinkedIn accounts across 89,000 candidates, the LLM method achieved 45.1% recall at 99% precision — a 450-fold improvement over traditional methods. At internet scale with one million candidates, the system maintains approximately 35% recall at 90% precision.

For businesses, this translates to:

How the Attack Works

The deanonymization process uses a four-stage pipeline that appears benign to detection systems. As documented by Simon Lermen's research, the attack decomposes into seemingly innocent tasks:

  1. Extract: LLMs identify demographics, writing style, and interests from raw text
  2. Search: Features convert to semantic embeddings for efficient candidate matching
  3. Rank: LLMs reason over top candidates to verify matches
  4. Score: Final calibration reduces false positives

Honest take: The genius lies in the decomposition. Each step looks like legitimate data analysis — summarizing profiles, computing embeddings, ranking results. No single component triggers security alerts.

The system works directly on unstructured text across arbitrary platforms. Unlike the Netflix Prize deanonymization that required structured rating data, this approach needs only natural language content.

Real-World Impact We're Already Seeing

Researchers tested their system on actual anonymous data with concerning results:

According to The Register's coverage, the automated approach replicates "in minutes what would take hours for a dedicated human investigator."

Your Options: Protection Strategies

Here's what we recommend based on the attack methodology:

1. Rate Limiting and Access Controls

The most effective short-term mitigation restricts data access:

2. Data Minimization

Reduce the attack surface:

3. Content Obfuscation

Make pattern matching harder:

4. Platform Architecture Changes

Long-term protection requires fundamental shifts:

The Hard Truth About Current Defenses

Traditional anonymization frameworks are "fundamentally inadequate" for this threat model, according to the research. Even LLM safety guardrails fail because:

What this means for your project: If you're relying on username changes, IP masking, or basic anonymization — you're not protected. The attack analyzes writing style, topic interests, and linguistic patterns that persist across identities.

Key Takeaway for Business

The era of "security through obscurity" for online anonymity has ended. LLMs have reduced the cost of deanonymization from hours of skilled investigation to minutes of automated processing at $1-4 per target.

Businesses must assume that any pseudonymous content can potentially be traced back to its author. This isn't a future risk — the DAS (De-Anonymization at Scale) framework already identifies same-author content "from pools of thousands at rates well above chance."

Here's what we recommend:

  1. Audit immediately: Identify all systems handling anonymous or pseudonymous data
  2. Implement rate limiting: Deploy access controls before broader architectural changes
  3. Update privacy policies: Inform users that absolute anonymity cannot be guaranteed
  4. Design for the new reality: Build future systems assuming deanonymization capabilities will only improve

The practical obscurity that once protected anonymous online participation no longer exists. Plan accordingly.

Related articles

Squeeze AI
  1. LLMs can identify anonymous users with up to 68% accuracy by analyzing writing patterns, making pseudonymous platforms like employee feedback systems and customer forums no longer reliably anonymous.
  2. At internet scale, deanonymization attacks cost only $1-4 per attempt and achieve 35% recall at 90% precision when matching across one million candidates, making large-scale user identification economically viable.
  3. The attack decomposes into seemingly legitimate data analysis steps—extracting writing patterns, computing embeddings, ranking matches—that individually avoid triggering security alerts while collectively enabling accurate deanonymization.
  4. Researchers successfully identified 27% of anonymous scientist interviewees from public transcripts and linked Reddit users across platforms despite year-long gaps, demonstrating the technique works on real-world anonymous data.

Squeezed by b1key AI