Multimodal RAG Assistant
Chat with your PDFs, Images, and Videos in Real-Time.
Key Features
- Multimodal Ingestion: Seamless semantic processing of PDFs, images (Tesseract OCR + Salesforce BLIP), and video (OpenAI Whisper + OpenCV).
- Visual & Audio Intelligence: Generated comprehensive video timelines by combining audio transcription with automated frame-by-frame visual captioning.
- Memory-Aware Chat: Implemented session-level conversational memory via LangChain to maintain deep context during complex, multi-turn analytical queries.
Impact & Results
- Proven Accuracy: Validated the retrieval pipeline via the Ragas framework, achieving a highly rigorous 0.98 Context Precision and 0.96 Answer Relevancy.
- Optimized Cloud Deployment: Successfully deployed on Hugging Face Spaces, resolving Streamlit OOM constraints by managing Linux dependencies within a 16GB RAM container.
- Unlocked "Dark Data": Transformed opaque multimedia assets into instantly searchable, grounded knowledge, completely eliminating LLM hallucinations.