Semantic searches and Retrieval Augmented Generation
By Aravind
semantic search
dense retrievals re-ranking
- first obtain the data
- chunk it (split by whitespace, new line etc )
- create embeddings
- store in vector DB (build searchable index)
- search based on similarity
Document search https://github.com/google/generative-ai-docs/blob/main/site/en/gemini-api/tutorials/document_search.ipynb
chroma vector DB example https://github.com/google-gemini/cookbook/blob/main/examples/chromadb/Vectordb_with_chroma.ipynb
Example
Modules to install
!pip install langchain-google-genai
!pip install langchain langchain-core langchain-community langchain-text-splitters langchain-huggingface faiss-cpu sentence-transformers
CONTENT = """
Interstellar is a 2014 epic science fiction film co-written, directed, and pro
duced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin,
Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film
follows a group of astronauts who travel through a wormhole near Saturn in
search of a new home for mankind.
Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its
origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne
was an executive producer, acted as a scientific consultant, and wrote a tie-in
book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision
anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland,
and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company
Double Negative created additional digital effects.
Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues
using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subse
quent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score,
visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy
and portrayal of theoretical astrophysics. Since its premiere, Interstellar
gained a cult following,[5] and now is regarded by many sci-fi experts as one
of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning
Best Visual Effects, and received numerous other accolades
"""
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_text(CONTENT)
len(all_splits)
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY"
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vector_1 = embeddings.embed_query(all_splits[0])
vector_2 = embeddings.embed_query(all_splits[1])
from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)
# convert to document if its only text chunking . if already documents use _add_documents
from langchain_core.documents import Document
documents = [
Document(
page_content=text,
metadata={
"chunk_id": i,
"source": "interstellar_info",
"chunk_size": len(text)
}
)
for i, text in enumerate(all_splits)
]
ids = vector_store.add_documents(documents)
results = vector_store.similarity_search(
"who are the cast ?"
)
print(results[0])
page_content=’Interstellar is a 2014 epic science fiction film co-written, directed, and pro duced by Christopher Nolan. It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind. Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007. Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar. Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm. Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.’ metadata={‘chunk_id’: 0, ‘source’: ‘interstellar_info’, ‘chunk_size’: 998}
Performing Similarity Search
Vector DBs are used to store embeddings and when one wants to perform similarity search, the search will be against the vector space using a variety of methods such as cosine similarity. Vector DBs such as chroma DB can be used. But below examples are based on in memory vector stores such as FAISS
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
import numpy as np
# Setup (assuming this is already done)
CONTENT = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.
Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.
Interstellar premiered on October 26, 2014, in Los Angeles.
The film had a worldwide gross over $677 million, making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
"""
# Create text splits and vector store
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
all_splits = text_splitter.split_text(CONTENT)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Convert to documents
documents = [
Document(page_content=text, metadata={"chunk_id": i})
for i, text in enumerate(all_splits)
]
# Create FAISS vector store
vector_store = FAISS.from_documents(documents, embeddings)
print(f"Vector store created with {len(documents)} documents")
# ========== NEAREST NEIGHBORS SEARCH METHODS ==========
# Method 1: Using LangChain's built-in methods
# Similarity search (returns documents)
query = "what was the revenue of the movie ?"
results = vector_store.similarity_search(query, k=3)
print(f"Query: '{query}'")
print(f"Top {len(results)} similar documents:")
for i, doc in enumerate(results):
print(f"{i+1}. {doc.page_content[:100]}...")
# Similarity search with scores
results_with_scores = vector_store.similarity_search_with_score(query, k=3)
print(f"\nSimilarity search with scores:")
for i, (doc, score) in enumerate(results_with_scores):
print(f"{i+1}. Score: {score:.4f} | Text: {doc.page_content[:80]}...")
Vector store created with 10 documents
Method 1: LangChain Built-in Methods
Query: ‘what was the revenue of the movie ?’ Top 3 similar documents:
- The film had a worldwide gross over $677 million, making it the tenth-highest grossing film of 2014….
- It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambi…
- It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon…
Similarity search with scores:
-
Score: 0.6788 Text: The film had a worldwide gross over $677 million, making it the tenth-highest gr… -
Score: 1.1418 Text: It received acclaim for its performances, direction, screenplay, musical score, … -
Score: 1.2572 Text: It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen…
Reranking
https://python.langchain.com/docs/integrations/document_transformers/rankllm-reranker/
** References **
- https://www.digitalocean.com/community/conceptual-articles/rag-ai-agents-agentic-rag-comparative-analysis
RAG (Retrival Augmented Generation)**
add a LLM to the search response. Helps reduce hallucinations and no need to fine tune the entire model again. This is how perplexity and others work. search engine + summarizations
## RAG (retrieval + grounded LLM for summarizations)
# Rag prompts
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
import numpy as np
from langchain import PromptTemplate
from langchain.chains.retrieval_qa.base import RetrievalQA
# define RAG prompt template
template = """
<|user|> relavent information: {context}
provide consise answer for the following question using relavent information provided above {question} <|end|>
<|assistant|>"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])
#import from local vector store . Similarly can query chroma and other vector stores
vectorstore = FAISS.from_documents(documents, embeddings)
# retrieve data
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={
"k": 3,
#"filter": {"source": "specific_document.pdf"}
}
)
rag = RetrievalQA.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=retriever,
chain_type_kwargs={"prompt": prompt})
result = rag({"query": "mention the casts of the movie intersteller?"})
print(result)
[‘result’: ‘The movie Interstellar stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.’}
ai_ml
]
tags: ai_ml