โ†Research
Researchvulnerability7 min read

LightRAG's Memgraph backend had a Cypher injection vulnerability hiding in plain sight

LightRAG's Memgraph storage backend interpolated unsanitised entity types directly into Cypher queries, enabling injection via the API. The Neo4j backend was already fixed.

LightRAG's Memgraph backend had a Cypher injection vulnerability hiding in plain sight

LightRAG is a popular open-source retrieval-augmented generation framework with over 30,000 stars on GitHub. Its Memgraph storage backend had a Cypher injection vulnerability that allowed an unauthenticated attacker to execute arbitrary graph database operations through the API. The same vulnerability class had already been fixed in the Neo4j backend. The Memgraph backend was simply missed.

I identified the issue, wrote a sanitisation fix and submitted PR #2849. It was merged on 28 March 2026.

The vulnerability

The upsert_node() method in lightrag/kg/memgraph_impl.py constructs a Cypher query using an f-string that interpolates entity_type directly into a backtick-quoted label:

SET n:`{entity_type}`

Backtick quoting in Cypher is the equivalent of square-bracket quoting in SQL Server or double-quote quoting in PostgreSQL: it delimits an identifier. If the value inside the backticks contains a literal backtick, the identifier closes early, and everything after it is interpreted as Cypher syntax.

The entity_type value reaches this query through two paths:

  1. Direct API calls. The POST /graph/entity/create and POST /graph/entity/edit endpoints accept entity_type as a JSON field and pass it through rag.acreate_entity() and _edit_entity_impl() respectively, all the way down to upsert_node() without sanitisation.
  2. LLM extraction. When LightRAG processes documents, the LLM extracts entity types from text. The extraction pipeline runs sanitize_and_normalize_extracted_text(), which strips extra whitespace but does not filter backticks. This path is harder to exploit (it requires prompt injection into ingested documents) but is not ruled out.

The direct API path is the primary concern. An attacker who can reach the API can inject arbitrary Cypher by including a backtick in the entity type.

What exploitation looks like

A minimal proof of concept:

POST /graph/entity/create HTTP/1.1
Content-Type: application/json
 
{
  "entity_name": "test_entity",
  "entity_data": {
    "description": "A test entity",
    "entity_type": "x` SET n.admin = true //"
  }
}

The resulting Cypher, before the fix, evaluates as:

MERGE (n:`workspace` {entity_id: $entity_id})
SET n += $properties
SET n:`x` SET n.admin = true //`

The // comments out the trailing backtick. The injected SET n.admin = true executes as a standalone Cypher clause, manipulating arbitrary properties on the node.

Property manipulation is the gentlest outcome. A destructive payload like:

x` WITH n MATCH (m) DETACH DELETE m //

would delete every node in the database. The WITH n clause bridges out of the MERGE context, MATCH (m) selects all nodes and DETACH DELETE m removes them along with their relationships. The entire knowledge graph, gone in one request.

Why the default deployment makes this worse

LightRAG's default configuration compounds the injection risk in three ways.

No authentication. The LIGHTRAG_API_KEY environment variable defaults to None. When unset, the API accepts all requests without any credential check. The authentication middleware exists but is opt-in, not opt-out.

Network exposure. The API server binds to 0.0.0.0 by default. The Docker Compose configuration exposes port 9621 directly, with no reverse proxy or service mesh in the path.

Unrestricted CORS. CORS origins default to "*", meaning any website can issue cross-origin requests to the API.

Taken together, this means a default LightRAG deployment with Memgraph is internet-facing, unauthenticated and reachable from any browser context. The injection does not require any special access. It requires an HTTP POST.

The fix that already existed next door

The Neo4j storage backend in neo4j_impl.py had the identical vulnerability at some point in its history. By the time I found the Memgraph issue, lines 1035 to 1054 of neo4j_impl.py on main already contained sanitisation logic: strip backticks, split comma-separated values, fall back to UNKNOWN on empty strings, log the modification.

The Memgraph backend was missing this defence entirely. The workspace_label parameter was already sanitised separately by _get_workspace_label() (which doubles backticks), but entity_type was the only remaining unsanitised user-controlled variable that reached a Cypher query.

This is a pattern worth naming: backend parity gaps. When a project supports multiple storage backends, a security fix applied to one backend does not automatically propagate to the others. Each backend gets its own implementation file, its own maintainer attention and, apparently, its own vulnerability timeline. The Neo4j fix made the Memgraph gap easier to find, but it also means the Memgraph backend was vulnerable for the entire period between the two fixes.

What the fix changes

The fix adds 21 lines of sanitisation to upsert_node() in memgraph_impl.py, positioned before the value is interpolated into the Cypher query:

# Coerce to str first so membership checks below never raise TypeError
entity_type = (
    str(entity_type) if not isinstance(entity_type, str) else entity_type
)
 
# Sanitize entity_type: strip backticks and handle comma-separated values.
if "`" in entity_type or "," in entity_type or not entity_type.strip():
    original = entity_type
    entity_type = entity_type.replace("`", "").strip()
    if "," in entity_type:
        entity_type = entity_type.split(",")[0].strip()
    if not entity_type:
        entity_type = "UNKNOWN"
    logger.warning(
        f"[{self.workspace}] Entity type sanitized in upsert_node: "
        f"'{original}' -> '{entity_type}'"
    )
    properties = dict(properties)
    properties["entity_type"] = entity_type

The logic handles four cases:

InputOutputEffect
"PERSON""PERSON"Clean input passes through unchanged
"x` SET n.admin=true //""x SET n.admin=true //"Backtick stripped, injection neutralised
"PERSON, ORGANIZATION""PERSON"Comma-separated LLM output reduced to first value
" " or "`""UNKNOWN"Empty or backtick-only input gets a safe default

The type coercion at the top catches non-string values that might arrive from API payloads, preventing TypeError on the subsequent membership checks. The warning log provides audit traceability when sanitisation fires.

One detail worth noting: the fix also updates the properties dict so the sanitised value is persisted to the database, not just used in the query label. Without this, the stored entity_type property would still contain the unsanitised original, creating a discrepancy between the node's label and its recorded type.

The recurring pattern in AI infrastructure

This is the third injection vulnerability I have found in AI framework infrastructure in 2026. The first was five SQL injection vectors in Hugging Face's skills framework, where DuckDB's multi-statement execution turned string formatting into near-arbitrary code execution. The pattern is consistent: a Python wrapper translates LLM output or user input into a query language, using string interpolation instead of parameterised queries or strict input validation.

RAG frameworks are particularly susceptible because they sit at the intersection of two trust boundaries. User input flows into LLM prompts, LLM output flows into database operations, and the framework treats both transitions as safe by default. The entity type in LightRAG is a good example: it originates either from a user-supplied JSON field or from an LLM's extraction of document text. Neither source should be trusted to produce syntactically safe Cypher labels, but the code assumed both would.

Graph databases add a wrinkle that relational databases do not have. In SQL, table and column names cannot be parameterised, but values can, and parameterised queries cover the vast majority of injection surface. In Cypher, labels are identifiers that also cannot be parameterised. Any label derived from user input must be validated or sanitised at the application layer. This is exactly what the fix does, and exactly what was missing.

What defenders should check

If you run LightRAG with a Memgraph backend, update to any commit after the merge of PR #2849. If you run LightRAG with Neo4j, verify that your deployment includes the equivalent sanitisation (it should, if you are on a recent main).

Regardless of backend, review your deployment configuration. If LIGHTRAG_API_KEY is unset, your API is unauthenticated. If you have not placed a reverse proxy or network policy in front of port 9621, your knowledge graph is reachable from the internet. The Cypher injection was the sharp end of the problem, but the default network posture is what made it trivially exploitable.

The broader lesson is structural. Multi-backend projects need security review processes that treat each backend as a separate attack surface. A fix in neo4j_impl.py does not protect memgraph_impl.py. The code is different, the query construction is different, and the maintainer who patched one file may never look at the other. If your project supports multiple storage backends, every security fix should trigger a scan of every backend for the same class of issue. Otherwise you are just playing whack-a-mole with your own codebase.

Newsletter

One email a week. Security research, engineering deep-dives and AI security insights - written for practitioners. No noise.