Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with text generation. Instead of relying solely on a standalone large language model (LLM), RAG first retrieves relevant data from a knowledge base and provides that context to the LLM. This approach makes responses more accurate and grounded in up-to-date information without requiring model retraining.
A fully managed hybrid search service that powers the retrieval component of RAG applications.
SQL and Python functions that provide access to industry-leading LLMs for the generation component.
-- Create database and warehouse
CREATE DATABASE IF NOT EXISTS cortex_search_db;
CREATE OR REPLACE WAREHOUSE cortex_search_wh WITH
WAREHOUSE_SIZE='X-SMALL';
CREATE OR REPLACE SCHEMA cortex_search_db.services;
-- Create table for documents
CREATE OR REPLACE TABLE support_transcripts (
transcript_text VARCHAR,
region VARCHAR,
agent_id VARCHAR
);
-- Insert sample data
INSERT INTO support_transcripts VALUES
('My internet has been down since yesterday, can you help?', 'North America', 'AG1001'),
('I was overcharged for my last bill, need an explanation.', 'Europe', 'AG1002'),
('How do I reset my password? The email link is not working.', 'Asia', 'AG1003'),
('I received a faulty router, can I get it replaced?', 'North America', 'AG1004');
-- Create Cortex Search Service for RAG retrieval
CREATE OR REPLACE CORTEX SEARCH SERVICE transcript_search_service
ON transcript_text
ATTRIBUTES region
WAREHOUSE = cortex_search_wh
TARGET_LAG = '1 day'
EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
AS (
SELECT
transcript_text,
region,
agent_id
FROM support_transcripts
);
-- Preview search results
SELECT PARSE_JSON(
SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
'cortex_search_db.services.transcript_search_service',
'{
"query": "internet issues",
"columns":["transcript_text", "region"],
"filter": {"@eq": {"region": "North America"} },
"limit": 5
}'
)
)['results'] as results;
-- Use COMPLETE function for RAG response generation
SELECT SNOWFLAKE.CORTEX.COMPLETE(
'mistral-large',
CONCAT(
'Based on the following context, answer the user question.\n\n',
'Context: ', retrieved_context, '\n\n',
'Question: ', user_question, '\n\n',
'Answer:'
)
) AS response;
-- Full RAG pipeline combining retrieval and generation
WITH retrieved_docs AS (
SELECT PARSE_JSON(
SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
'cortex_search_db.services.transcript_search_service',
'{
"query": "billing overcharge",
"columns": ["transcript_text", "region"],
"limit": 3
}'
)
)['results'] AS results
)
SELECT SNOWFLAKE.CORTEX.COMPLETE(
'mistral-large',
CONCAT(
'You are a helpful customer support assistant. ',
'Based on the following support transcripts, provide a helpful response.\n\n',
'Relevant transcripts: ', results::VARCHAR, '\n\n',
'Customer question: How do I dispute a billing charge?\n\n',
'Response:'
)
) AS rag_response
FROM retrieved_docs;
from snowflake.core import Root
from snowflake.snowpark import Session
from snowflake.cortex import Complete
# Create session
CONNECTION_PARAMETERS = {
"account": "your_account",
"user": "your_user",
"password": "your_password",
"warehouse": "cortex_search_wh",
"database": "cortex_search_db",
"schema": "services"
}
session = Session.builder.configs(CONNECTION_PARAMETERS).create()
root = Root(session)
# Get the search service
search_service = (root
.databases["cortex_search_db"]
.schemas["services"]
.cortex_search_services["transcript_search_service"]
)
# Retrieval: Search for relevant documents
user_query = "internet connection problems"
search_response = search_service.search(
query=user_query,
columns=["transcript_text", "region"],
filter={"@eq": {"region": "North America"}},
limit=3
)
# Get retrieved context
retrieved_context = search_response.to_json()
# Generation: Use LLM with retrieved context
prompt = f"""Based on the following customer support transcripts,
provide a helpful response to the user's question.
Retrieved transcripts: {retrieved_context}
User question: {user_query}
Response:"""
response = Complete(
model="mistral-large",
prompt=prompt,
session=session
)
print(response)
SPLIT_TEXT_RECURSIVE_CHARACTER function for chunkingCortex Search is a managed hybrid (vector + keyword) retrieval service inside Snowflake. You point it at a column in a table, pick an embedding model, and Snowflake handles chunking, embedding, indexing, and incremental refresh as the underlying table changes. Querying is a single SQL function call. A DIY pipeline using EMBED_TEXT_768() plus a manual VECTOR column gives you more control (custom chunkers, multiple embedding models, custom retrieval logic) but you own the refresh, the indexing strategy, and the BM25 fusion. Cortex Search wins on operational simplicity; DIY wins on flexibility.
Two patterns: let Cortex Search handle it (you set CHUNK_SIZE and CHUNK_OVERLAP in the service definition), or pre-chunk in SQL/Snowpark and store one row per chunk. Pre-chunking is preferable when you have layout-aware splitting requirements — e.g., never split inside a code block, keep table rows together, or use semantic chunking from a Python UDF. Store doc_id, chunk_index, chunk_text, and any metadata needed for filtering in one table; Cortex Search indexes the text column and returns matching rows with metadata for citation.
Retrieval returns the chunk text plus the metadata columns you indexed (doc_id, source URL, page number, last_modified). Pass these to CORTEX.COMPLETE() in the prompt as a numbered list, and instruct the model to emit answers with bracketed citations like [1] referring to the list. In the application, render each citation as a link back to the source. For higher fidelity, return the chunk offsets so the UI can highlight the exact span. The whole pipeline stays auditable — you can replay any answer back to the exact rows that produced it.
Use Snowflake's native row access policies on the source table — the same policy that gates raw row access also gates which chunks a user can retrieve, because Cortex Search honors the caller's role. Document-level ACLs become a column (allowed_roles ARRAY) plus a policy that filters rows where the current role is not in the array. This is a major advantage over external vector stores: governance lives with the data, not in app code, so a misconfigured app cannot leak documents the user shouldn't see.
Snowflake exposes several through EMBED_TEXT_768 and EMBED_TEXT_1024 — common choices include snowflake-arctic-embed-m, e5-base-v2, and nv-embed-qa-4. For most English RAG, start with snowflake-arctic-embed-m — strong MTEB performance and tightly integrated. For multilingual corpora pick a multilingual variant. Run a small evaluation on your domain (recall@5 on a labeled query/relevant-chunk set) before committing — the right model is workload-dependent and the differences are not subtle.
Partition the corpus into multiple Cortex Search services along a natural axis (region, business unit, document type) so each service stays under the per-service row limit and refreshes independently. Use a thin Streamlit-in-Snowflake or external app that fans out queries to the relevant services and merges results. Monitor via SNOWFLAKE.ACCOUNT_USAGE.CORTEX_SEARCH_SERVING_USAGE_HISTORY and CORTEX_FUNCTIONS_USAGE_HISTORY for cost; track recall offline with a labeled eval set; alarm on refresh-job lag (last refreshed_on timestamp older than your SLA). For freshness-critical content, drive refresh from a stream + task on the source table rather than the default schedule.