Table of Contents
About
Semantic search over all 450+ My First Million (MFM) podcast episodes. Can be used for any YouTube playlist. I chose a podcast for this hackathon as it is something most people are familiar with. Can combine multiple channels, parts of channels, or just an assortment of videos of your choice.
- YouTube V3 API - Fetches and processes videos from YouTube to use as transcript backend powering semantic search.
- Milvus.io / Zilliz - vector DB backend storing video transcript data and powering semantic search for the frontend.
- OpenAI's text-embedding-ada-002 - used in conjunction with vector DB. Allows client more tools beyond basic keyword search. Read more on k-nearest-neighbor (KNN) algorithm.
Videos are transcribed using some hacky Python scripts, combined with associated metadata, and pre-processed (cleaned). The transcipts are chunked and vectorized into a database by tokens and converted to text embeddings with ~ 16k dimensions. There are limitations; for those who care more about this topic, read the Milvus documentation.
Next Steps & Feedback
Some of my plans to improve this project:
- Moving away from YouTube V3 API towards a faster transcribing solution. Whisper is good but expensive and pytube and other Python packages are probably going to be used once the amoutn of video content exceeds a certain storage capacity.
- Adding visual elements to search experience (i.e. thumnbail generation specific to the exact timestamp) using Puppeteer or some other solution.