1. The Backstory: The $4,500 OpenAI API Bill That Squeezed a Startup's Margins

At CodexiLab, we love building AI solutions, but we are equally obsessed with keeping them cost-effective and performant. Last quarter, a customer-support SaaS startup came to us with a serious financial challenge. They had built an AI agent to handle customer queries for online e-commerce stores. The agent was powered by OpenAI's GPT-4o model and was highly successful, resolving over 85% of incoming support tickets without human intervention.

However, as their query volume grew to over 150,000 queries per month, their OpenAI API bill skyrocketed to over $4,500. Every single support request required sending a large prompt context (containing store FAQs, shipping guidelines, and order policies) along with the user's question, consuming substantial input tokens. Furthermore, because users frequently asked the same questions in slightly different ways (e.g. 'Where is my order?' vs. 'Track my package' vs. 'I haven't received my items yet'), the system was paying to process the exact same response logic repeatedly, resulting in high latency and astronomical operating costs.

We realized that traditional database caching was useless here. We needed a system that could understand the semantic meaning of questions and serve cached responses for conceptually identical queries, regardless of how they were phrased. This technical post details the design and implementation of a production-grade LLM semantic caching layer using Redis HNSW vector search, which cut our client's API costs by 80% and dropped response times to 15 milliseconds.

2. The Limitation of Traditional Caching in GenAI Applications

In traditional web development, caching is simple. You hash a request URL, an database query, or an API key, and store the result in a key-value store like Memcached or Redis. If a matching key is requested, you serve the value instantly. This works because standard web queries are deterministic: a request for GET /products/123 is always identical.

In Generative AI, this pattern breaks down. Natural language is non-deterministic. A user can ask for shipping information in infinite ways:

  • 'How long does shipping take?'
  • 'What is the delivery time?'
  • 'When will my package arrive?'
  • 'Shipping speed?'

To a traditional cache, these are four completely distinct strings, resulting in cache misses. Yet, they all share the exact same intent and expect the same answer. If we run these four queries through an LLM, we waste tokens generating the same response four times. To solve this, we must build a semantic cache. Instead of caching based on string equality, we cache based on semantic similarity in a high-dimensional vector space.

3. The Core Concept: Vector Embeddings and Cosine Similarity Thresholds

To implement a semantic cache, we use vector embeddings. An embedding model (such as OpenAI's text-embedding-3-small or a local HuggingFace sentence-transformer) maps natural language text into a dense vector representation. Text blocks with similar semantic meanings are placed close together in this high-dimensional vector space.

When a user query enters our system, we follow this process:

  1. Generate an embedding vector for the incoming query.
  2. Search a vector index (in our case, hosted in Redis) for previously cached queries whose vectors are close to the new query vector.
  3. Calculate the distance (or similarity) between the new query and the closest cached query. We use Cosine Similarity as our metric: a score of 1.0 represents identical vectors, while lower scores represent increasing conceptual distance.
  4. If the cosine similarity exceeds a specific threshold (e.g. 0.96), we consider it a 'cache hit' and return the cached LLM response. If the similarity is below the threshold, we consider it a 'cache miss', send the query to the LLM, and then cache the query-response pair for future use.

Selecting the right similarity threshold is a delicate engineering challenge. If the threshold is too high (e.g. 0.99), the cache hit rate will be very low, defeating the purpose of the cache. If the threshold is too low (e.g. 0.88), the cache will trigger 'false hits', serving responses to questions that are conceptually different. For example, a query about 'exchanging an item' might retrieve a cached response for 'returning an item', which has different store policy rules. Through extensive testing on our client's dataset, we found that a similarity threshold of 0.95 to 0.96 balanced safety and hit rate perfectly.

python
import redis
import numpy as np
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query

# Establish Redis Connection
r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# Vector dimensions (e.g. 384 for sentence-transformers, 1536 for OpenAI)
VECTOR_DIM = 384
INDEX_NAME = "semantic_cache"

def initialize_redis_vector_index():
    """
    Initializes an HNSW Vector Index in Redis for semantic caching.
    """
    try:
        r.ft(INDEX_NAME).info()
        print("Index already exists.")
    except redis.exceptions.ResponseError:
        # Define HNSW (Hierarchical Navigable Small World) index parameters
        schema = (
            TextField("query"),
            TextField("response"),
            VectorField("query_vector", "HNSW", {
                "TYPE": "FLOAT32",
                "DIM": VECTOR_DIM,
                "DISTANCE_METRIC": "COSINE",
                "INITIAL_CAP": 10000,
                "M": 16,
                "EF_CONSTRUCTION": 200,
                "EF_RUNTIME": 50
            })
        )
        r.ft(INDEX_NAME).create_index(
            fields=schema,
            definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH)
        )
        print("HNSW Vector Index created successfully.")

4. Step-by-Step Implementation of the Semantic Cache using Redis HNSW

The code block above shows how to initialize an HNSW (Hierarchical Navigable Small World) vector index in Redis using the Redis Stack commands. HNSW is a highly optimized graph-based algorithm for approximate nearest neighbor search, providing sub-millisecond query execution speeds even when scaling to millions of cached vectors.

Once the index is initialized, we implement the search and cache insertion flows. During a search, we convert the user's query into a vector, format the raw bytes, and execute a Redis FT.SEARCH query. If a match is found within our cosine similarity threshold, we return the cached response, logging a cache hit. Otherwise, we execute the LLM call, package the query, response, and raw query vector bytes, and upsert them to Redis under a unique key.

python
def search_semantic_cache(query_vector: np.ndarray, similarity_threshold: float = 0.95) -> str | None:
    """
    Queries the Redis HNSW index for a semantic cache hit.
    """
    # Convert numpy float32 array to raw byte string
    query_vector_bytes = query_vector.astype(np.float32).tobytes()
    
    # Construct vector query: find 1 nearest neighbor (K=1)
    query_str = f"*=>[KNN 1 @query_vector $vector AS distance]"
    q = Query(query_str).sort_by("distance").paging(0, 1).return_fields("response", "query", "distance")
    
    # Execute search in Redis
    results = r.ft(INDEX_NAME).search(q, query_params={"vector": query_vector_bytes})
    
    if results.docs:
        match = results.docs[0]
        # Cosine distance in Redis = 1 - Cosine Similarity
        distance = float(match.distance)
        similarity = 1.0 - distance
        
        if similarity >= similarity_threshold:
            print(f"[Semantic Hit] Match Score: {similarity:.4f}")
            return match.response
            
    return None

def insert_into_cache(query_text: str, response_text: str, query_vector: np.ndarray):
    """
    Inserts a query-response pair with its vector into Redis.
    """
    key = f"cache:{r.incr('cache_id_seq')}"
    r.hset(key, mapping={
        "query": query_text,
        "response": response_text,
        "query_vector": query_vector.astype(np.float32).tobytes()
    })

5. Cache Eviction, TTL, and Handling Vector Index Bloat

A production semantic cache cannot grow indefinitely; it requires a cache eviction policy to prevent memory exhaustion. However, standard Redis TTL (Time-To-Live) commands behave differently on hash keys mapped inside a search index. If a hash key expires and is deleted by Redis, the index updates automatically, which is exactly what we want. We can set a TTL of 7 days on our cache keys, ensuring that outdated data is evicted and freeing up memory.

Additionally, we must handle the problem of 'cache drift'. If a business changes its store policies (e.g. shipping time changes from 3 days to 5 days), any existing cached responses about shipping times will now contain outdated, incorrect information. To resolve this, we implement a namespace-invalidation mechanism. When an admin updates an FAQ or store policy document, our system calculates which semantic entities are affected and sends an invalidation command. This command deletes all cached vectors within that specific namespace, ensuring that the AI agent immediately fetches fresh data on its next query.

6. Security Best Practices: Preventing Cache Poisoning and Prompt Injection

Semantic caching introduces a unique security vulnerability known as 'cache poisoning'. A malicious user can attempt to manipulate the cache by asking a question that is semantically similar to a high-privilege query but leads to an incorrect or malicious response. For example, an attacker might ask, 'What is the admin password reset link?' in a way that maps close to 'What is the user password reset link?'.

To mitigate cache poisoning, we apply three strict guardrails:

  1. Categorical Scoping: We never use a single global cache index. Instead, we partition the cache by intent and user privilege level. User support queries are searched only within the public_support_cache namespace. Internal admin queries use a separate, cryptographically isolated index.

  2. Sanitize Queries Before Embedding: We run a lightweight regex and string cleaner on all queries before passing them to the embedding model. This strips out random punctuation, duplicate characters, and common prompt injection payloads, ensuring that the computed vector represents the clean semantic intent rather than the injection wrapper.

  3. Automated Cache Validation Audits: We run a daily background cron job that evaluates a sample of cache hits, checking if the query and the cached response are logically aligned. If the system detects a mismatch, it alerts our security operations team and invalidates the affected keys.

7. The Results: Instant Responses and Sustained Margins

Implementing the Redis HNSW semantic caching layer was a game-changer for our client. Their average cache hit rate reached 42% within the first two weeks, meaning nearly half of all incoming support queries were answered instantly without querying OpenAI. This dropped average response latency for cached requests from 2.2 seconds to 15 milliseconds, dramatically boosting user satisfaction.

More importantly, the reduction in token usage translated to a 78% reduction in their monthly OpenAI API costs, bringing their monthly bill down from $4,500 to less than $900. By combining embedding models with optimized vector graph searches in Redis, we built a secure, high-performance, and cost-efficient caching layer that allows their AI products to scale sustainably.

8. Frequently Asked Questions (FAQ)

Q: Does semantic caching work if the user's query contains typos?
A: Yes, absolutely. Modern embedding models are highly robust against spelling mistakes and typos, mapping the misspelled query to nearly the same vector representation as the correct spelling, ensuring a high cache hit rate.

Q: How do we choose the right embedding model for our cache?
A: For local hosting, BGE-Micro or MiniLM-L6 are excellent, lightweight options (under 100MB) that run in sub-10ms on CPU. If you are already using OpenAI APIs, using their text-embedding-3-small model is highly convenient, though it introduces a network roundtrip.

Q: What is the maximum number of vectors we can search in Redis?
A: Redis HNSW can search millions of vectors in under a millisecond, provided the host machine has enough RAM. Each 384-dimensional vector consumes about 1.5KB of memory. A index of 100,000 vectors will consume less than 200MB of RAM, making it extremely lightweight.*