diff --git a/src/oss/python/integrations/vectorstores/teradata.mdx b/src/oss/python/integrations/vectorstores/teradata.mdx new file mode 100644 index 0000000000..4c91a781dc --- /dev/null +++ b/src/oss/python/integrations/vectorstores/teradata.mdx @@ -0,0 +1,385 @@ +--- +title: TeradataVectorStore +--- + +>Teradata Vector Store is designed to store, index, and search high-dimensional vector embeddings efficiently within your enterprise data platform. + +This guide shows you how to quickly get up and running with TeradataVectorStore for your semantic search and RAG applications. Whether you're new to Teradata or looking to add AI capabilities to your existing data workflows, this guide will walk you through everything you need to know. + +**What makes TeradataVectorStore special?** +- Built on enterprise-grade Teradata Vantage platform. +- Seamlessly integrates with your existing data warehouse. +- Supports multiple vector search algorithms for different use cases. +- Scales from prototype to production workloads. + +## Setup + +Before we dive in, you'll need to install the necessary packages. TeradataVectorStore is part of the `langchain-teradata` package, which also includes other Teradata integrations for LangChain. + +**New to Teradata?** Refer to : +- [Teradata VantageCloud Lake](https://www.teradata.com/platform/vantagecloud) +- Get started with [VantageCloud Lake](https://docs.teradata.com/r/Lake-Getting-Started-with-VantageCloud-Lake/) + +### Installation + + + ```python pip + pip install langchain-teradata + ``` + +### Credentials + +**Connecting to Teradata:** The `create_context()` function establishes your connection to the Teradata Vantage system. This is how teradataml (and by extension, TeradataVectorStore) knows which database to connect to and authenticate with. + +**What you'll need:** +- **hostname**: Your Teradata system's address +- **username/password**: Your database credentials +- **base_url**: API endpoint for your Teradata system +- **pat_token**: Personal Access Token for API authentication +- **pem_file**: SSL certificate file for secure connections + +**For more information** Check out the [Teradata Vector Store User Guide](https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Vector-Store-User-Guide/Setting-up-Vector-Store/Required-Privileges) for detailed setup instructions. + +**For information related to teradataml** Refer to [TeradataML User Guide](https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide/Introduction-to-Teradata-Package-for-Python) + +```python +import os +from getpass import getpass +from teradataml import create_context + +os.environ['TD_HOST'] = getpass(prompt='hostname: ') +os.environ['TD_USERNAME'] = getpass(prompt='username: ') +os.environ['TD_PASSWORD'] = getpass(prompt='password: ') +os.environ['TD_BASE_URL'] = getpass(prompt='base_url: ') +os.environ['TD_PAT_TOKEN'] = getpass(prompt='pat_token: ') +os.environ['TD_PEM_FILE'] = getpass(prompt='pem_file: ') +create_context() +``` + +--- + +## Instantiation + +**Initialize your embeddings** + +**TeradataVectorStore supports three types of embedding objects:** +1. **String identifiers** (e.g., "amazon.titan-embed-text-v1") +2. **TeradataAI objects** +3. **LangChain embedding objects** - LangChain-compatible embedding model objects + +```python +# Initialize embeddings +from langchain_aws import BedrockEmbeddings +embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", region_name="us-west-2") +``` + +**Create Your First Vector Store** + +Let's start with some sample Documents and create a vector store. The `from_documents()` method is one of the most straightforward ways to get started - just pass in your documents and TeradataVectorStore handles the rest. + +**What happens under the hood:** +- Your documents get converted to a Teradataml Dataframe and passed to the vector store +- The embeddings are generated and stored for each Document object +- Indexes are automatically created for fast similarity search and chat operations + +```python +from langchain_teradata import TeradataVectorStore +from langchain_core.documents import Document +# Sample documents about different topics +docs = [ + Document(page_content="Teradata provides scalable data analytics solutions for enterprises."), + Document(page_content="Machine learning models require high-quality training data to perform well."), + Document(page_content="Vector databases enable semantic search capabilities beyond keyword matching."), + Document(page_content="LangChain simplifies building applications with large language models."), + Document(page_content="Data warehousing has evolved to support real-time analytics and AI workloads.") +] + +# Create the vector store +vs = TeradataVectorStore.from_documents( + name="my_knowledge_base", + documents=docs, + embedding=embeddings +) + +print("Vector store created successfully!") +``` + +After creating your vector store, it's always good practice to verify that everything was set up correctly. TeradataVectorStore provides helpful methods to monitor your operations and understand what's happening behind the scenes. + +**Why check status?** +- **Operation tracking**: See exactly which stage your vector store creation is at. +- **Troubleshooting**: Quickly identify if something went wrong during setup. +- **Progress monitoring**: For large datasets, track embedding generation progress. +- **Validation**: Confirm your vector store is ready for queries. + +```python +# Check the status of the store. +vs.status() +``` + +Want to see what's actually inside your vector store? The `get_details()` method gives you a comprehensive overview of your setup - think of it as your vector store's "dashboard." + +**What you'll see:** +- **Object inventory**: Number of tables or documents you have added. +- **Search parameters**: Current algorithm settings (HNSW, K-means, etc.) +- **Configuration details**: Embedding dimensions, distance metrics, and indexing options. +- **Performance settings**: Top-k values, similarity thresholds, and other query parameters. + +```python +vs.get_details() +``` + +--- + +## Manage vector store + +### Add items to vector store + +One of the best features of TeradataVectorStore is how easy it is to expand your knowledge base. As your business grows and you have more documents, you can continuously add them without rebuilding everything from scratch. + +**Real-world scenarios:** +- Add new product documentation as it's created. +- Include fresh research papers or industry reports. +- Incorporate customer feedback and support documents. +- Update with latest policy or procedure changes. + +**Enterprise advantage:** Since everything runs on Teradata, you can easily add data from your existing tables, data warehouses, or real-time feeds without complex data movement. + +```python +# Add more documents +additional_docs = [ + Document(page_content="Retrieval-augmented generation combines the power of search with language models."), + Document(page_content="Teradata's vector capabilities support both structured and unstructured data analysis.") +] + +vs.add_documents(documents=additional_docs) +print("Added more knowledge to the vector store!") +``` + +```python +# Check the status of the new store. +vs.status() +``` + +--- + +## Query vector store + +Once your vector store has been created and the relevant documents have been added, you will most likely wish to query it during the running of your chain or agent. + +### Query directly + +Now let's search for information in our vector store. Unlike traditional keyword search, vector search understands the meaning behind your questions. Ask about "AI applications" and it might return results about "machine learning models" because it understands these concepts are related. + +**How similarity search works:** +- Your question gets converted to a vector embedding (just like your documents). +- TeradataVectorStore calculates similarity scores between your question and stored documents. +- The most relevant results are returned, ranked by similarity. + +```python +# Ask a question +question = "What are vector databases?" +results = vs.similarity_search(question=question, return_type = "json") + +print("Found relevant information:") +for result in results.similar_objects: + print(f" {result}") +``` + +### Query by turning into retriever + +You can also transform the vector store into a retriever for easier usage in your chains. + +```python +# Create a retriever for your RAG pipeline +retriever = vs.as_retriever(search_type="similarity") + +# Test the retriever +retrieved_docs = retriever.invoke("Tell me about Teradata's capabilities") + +print("Retrieved documents for RAG:") +for doc in retrieved_docs: + print(f"- {doc.page_content}") +``` + +--- + +## Usage for retrieval-augmented generation + +The `ask()` combines the power of vector search with language model generation. Instead of just returning raw document chunks, you get coherent, contextual answers. + +**The two-step process:** +1. **Retrieval**: Find the most relevant documents from your vector store. +2. **Generation**: Use those documents as context to generate a natural language response. + +**Why this is powerful:** Your AI responses are grounded in your actual data, reducing hallucinations and ensuring accuracy. It's like having a knowledgeable assistant who actually read your company's documents! + +```python +# Get a comprehensive answer +response = vs.ask(question="What are the benefits of using vector databases?") +print("AI Response:") +print(response) +``` + +Retrieval-Augmented Generation (RAG) is the technique that powers most modern AI assistants and chatbots. TeradataVectorStore integrates seamlessly with LangChain to make building RAG applications straightforward. + +**What makes a good RAG application:** +- **Relevant retrieval**: Your vector store finds the right information. +- **Contextual generation**: The language model uses that information effectively. +- **Source transparency**: Users can see where answers come from. + + +**How it works with TeradataVectorStore**: +- You can use your vector store as a retriever to get the most relevant documents, then pass those documents to a RAG chain within LangChain workflows. +- This gives you the flexibility to build custom pipelines while leveraging Teradata's powerful vector search capabilities. + +Now let's build a complete RAG pipeline that combines your TeradataVectorStore retriever with a language model. This demonstrates the full power of RAG - retrieving relevant information from your vector store and using it to generate informed responses. + +**What's happening in this pipeline:** + +- Retrieval: Your vector store finds the most relevant documents for the question. +- Context preparation: Those documents become context for the language model. +- Generation: The LM generates an answer based on your actual data. +- Output parsing: Clean, formatted response ready for your application. + + +**Real-world applications:** + +- Customer support: Answer questions using your product documentation. +- Research assistance: Query your organization's knowledge repositories. +- Compliance: Ensure responses are based on approved company information. + +```python +from langchain_core.runnables import RunnablePassthrough +from langchain_core.prompts import PromptTemplate +from langchain_core.output_parsers import StrOutputParser +from langchain.chat_models import init_chat_model + +#Example: Simple RAG chain +# Initialize the chat model +llm = init_chat_model("anthropic.claude-3-5-sonnet-20240620-v1:0", + model_provider="bedrock_converse", + region_name="", + aws_access_key_id = "" , + aws_secret_access_key = "" + ) + + +# Create a prompt template for the LLM to format its response using retrieved context +prompt = PromptTemplate.from_template( + "Use the following context to answer the question.\nContext:\n{context}\n\nQuestion: {question}\nAnswer:" +) + +# Build the RAG chain: retrieve context, format prompt, generate answer, and parse output +rag_chain = ( + { + "context": retriever, + "question": RunnablePassthrough() + } + | prompt + | llm + | StrOutputParser() +) + +# Invoke the RAG chain with a sample question and print the response +response = rag_chain.invoke("Benefits of Vector Store") +print(response) +``` + +--- + +## Working with Different Data Types + +TeradataVectorStore's flexibility really shines when working with different types of data sources. Depending on what you're starting with, you can choose the most appropriate method. + +**Choose your starting point:** +- **Have PDF documents?** Use `from_documents()` with file paths +- **Working with database tables?** Use `from_datasets()` with DataFrames +- **Already have embeddings?** Use `from_embeddings()` to import them directly + +### From PDF Files +```python +# File-based vector store from PDFs +pdf_vs = TeradataVectorStore.from_documents( + name="pdf_knowledge", + documents="path/to/your/document.pdf", # or list of PDF paths + embedding=embeddings +) +``` + +### From Database Tables +```python +# Content-based from existing tables +from teradataml import DataFrame +table_data = DataFrame('your_table_name') + +table_vs = TeradataVectorStore.from_datasets( + name="table_knowledge", + data=table_data, + data_columns=["text_column"], + embedding=embeddings +) +``` + +### From Pre-computed Embeddings +```python +# If you already have embeddings +embedding_vs = TeradataVectorStore.from_embeddings( + name="embedding_store", + data=your_embedding_data, + data_columns="embedding_column" +) +``` + +***Note***
+When working with tables (and embedded tables), the `data_columns` parameter is mandatory. This tells TeradataVectorStore exactly which columns contain the text content you want to convert into embeddings. Think of it as pointing the service to the right information + +For example, if your table has columns like id, title, description, and category, you'd specify data_columns=["description"] to embed only the description text, or data_columns=["title", "description"] to combine both fields. + +Below is a small example of loading sample table with `teradatagenai` and creating a content based store out of it. For the data_columns we will pass the "rev_text" column which will be used to generate the embeddings. + +```python +from teradatagenai import load_data + +# Load sample data into Teradata +load_data("byom", "amazon_reviews_25") + +# Create a vector store from the Teradata table +td_vs = TeradataVectorStore.from_datasets( + name="table_store_amazon", + data="amazon_reviews_25", + data_columns="rev_text", + embedding=embeddings) +``` + +```python +# Check the status of the new store +td_vs.status() +``` + +--- + +## Next Steps + +Congratulations! You've just built your first AI-powered search and RAG system with TeradataVectorStore. You're now ready to scale this up to handle real enterprise workloads. + +**Ready to go deeper?** +- **Advanced search algorithms**: Try HNSW or K-means clustering for large-scale deployments +- **Custom embedding models**: Experiment with domain-specific embeddings for your industry +- **Real-time updates**: Set up pipelines to automatically update your vector store as new data arrives + +**Production considerations:** +- **Security**: Leverage Teradata's enterprise security features +- **Monitoring**: Use Teradata's built-in performance monitoring + +**Learn more:** +- [LangChain RAG Tutorials](https://python.langchain.com/docs/tutorials/rag) - Deep dive into RAG patterns +- [TeradataVectorStore Workflows](https://github.com/Teradata/langchain-teradata) - Complete examples and use cases +- [VantageCloud Lake](https://www.teradata.com/platform/vantagecloud) - Cloud-native analytics platform + +--- + +## API reference + +For detailed documentation of all TeradataVectorStore features and configurations head to the API reference. +[langchain-teradata User Guide](https://docs.teradata.com/search/documents?query=Teradata+Package+for+LangChain&sort=last_update&virtual-field=title_only&content-lang=en-US) \ No newline at end of file