MFM Podcast Chat

AI-powered Semantic search over all 450+ My First Million (MFM) podcast episodes 💰

About

Semantic search over all 450+ My First Million (MFM) podcast episodes. Can be used for any YouTube playlist. I chose a podcast for this hackathon as it is something most people are familiar with. Can combine multiple channels, parts of channels, or just an assortment of videos of your choice.

YouTube V3 API - Fetches and processes videos from YouTube to use as transcript backend powering semantic search.
Milvus.io / Zilliz - vector DB backend storing video transcript data and powering semantic search for the frontend.
OpenAI's text-embedding-ada-002 - used in conjunction with vector DB. Allows client more tools beyond basic keyword search. Read more on k-nearest-neighbor (KNN) algorithm.

Videos are transcribed using some hacky Python scripts, combined with associated metadata, and pre-processed (cleaned). The transcipts are chunked and vectorized into a database by tokens and converted to text embeddings with ~ 16k dimensions. There are limitations; for those who care more about this topic, read the Milvus documentation.

Next Steps & Feedback

Some of my plans to improve this project:

Moving away from YouTube V3 API towards a faster transcribing solution. Whisper is good but expensive and pytube and other Python packages are probably going to be used once the amoutn of video content exceeds a certain storage capacity.
Adding visual elements to search experience (i.e. thumnbail generation specific to the exact timestamp) using Puppeteer or some other solution.