Skip to main content

LogoLogoLogo

YouTubeGPT ft. Matt Wolfe (@mreflow)

AI Chatbot with 100+ videos from YouTuber Matt Wolfe

Website Github


Table of Contents

    📝 About
      💻 How to build
      🚀 Next steps
        🔧 Tools used
          👤 Contact

        📝About

        • Chat with 100+ YouTube videos from any creator in less than 10 minutes.
        • Type in natural language and get returned detailed answers:
          • (1) in the style / tone of your YouTuber
          • (2) with the top 2-3 specific videos referenced hyperlinked

        Example used in this repo is tech content creator Matt Wolfe.


        💻 How to build

        Note: macOS version, adjust accordingly for Windows / Linux

        Initial setup

        Clone and install dependencies:

        git clone https://github.com/vdutts7/mreflow-ai
        cd ai-mreflow
        npm i

        Copy .env.example and rename to .env in root directory. Fill out API keys:

        ASSEMBLY_AI_API_TOKEN=""
        OPENAI_API_KEY=""
        PINECONE_API_KEY=""
        PINECONE_ENVIRONMENT=""
        PINECONE_INDEX=""

        Get API keys:

        IMPORTANT: Verify that .gitignore contains .env in it.

        Handle massive data

        Outline:

        • Export metadata (.csv) of YouTube videos ⬇️
        • Download the audio files
        • Transcribe audio files

        Navigate to scripts folder, which will host all of the data from the YouTube videos.

        cd scripts

        Setup python environemnt:

        conda env list
        conda activate youtube-chat
        pip install -r requirements.txt

        Scrape YouTube channel-- replace @mreflow with @username of your choice. Replace <k-last-vids> with the number of videos you want included (the script traverses backwards starting from most recent upload). A new file <your-csv-file>.csv will be created at the directory as referenced below:

        python scripts/scrape_vids.py https://www.youtube.com/@<username> `<k-last-vids>` scripts/vid_list/<your-csv-file>.csv

        Refer to example.csv inside folder and verify your output matches this format:

        imageDownload audio files:
        python scripts/download_yt_audios.py scripts/vid_list/<your-csv-file>.csv scripts/audio_files/
        image

        We will utilize AssemblyAI's API wrapper class for OpenAI's Whisper API. Their script provides step-by-step directions for a more efficient, faster speech-to-text conversion as Whisper is way too slow and will cost you more. I spent ~ $3.50 to transcribe the 112 videos for Matt Wolfe.

        image
        python scripts/transcribe_audios.py scripts/audio_files/ scripts/transcripts
        image

        Upsert to Pinecone database:

        python scripts/pinecone_helper.py scripts/vid_list/<your-csv-file>.csv scripts/transcripts/

        Pinecone index setup I used below. I used P1 since this is optimized for speed. 1536 is OpenAI's standard we're limited to when querying data from the vectorstore:

        image

        Embeddings and database backend

        Breaking down scripts/pinecone_helper.py :

        • Chunk size of 1000 characters with 500 character overlap. I found this working for me but obviously experiment and adjust according to your content library's size, complexity, etc.
        • Metadata: (1) video url and (2) video title

        With Pinecone vectorstore loaded, we use Langchain's Conversational Retrieval QA to ask questions, extract relevant metadata from our embeddings, and deliver back to the user in a packaged format as an answer.

        The relevant video titles are cited via hyperlinks directly to the video url.

        Frontend UI with chat

        NextJs styled with Tailwind CSS. src/pages/index.tsx contains base skeleton. src/pages/api/chat-chain.ts is heart of the code where the Langchain connections are outlined.

        Run app

        npm run dev

        Go to http://localhost:3000. You should be able to type and ask questions now. Done ✅

        LogoScreenshot 2023-06-20 at 4 17 08 PM

        🚀Next Steps

        UI/UX: change to your liking.

        Bot personality: edit prompt template in /src/pages/api/chat-chain.ts to fine-tune and add greater control on the bot's outputs.


        🔧Tools Used

        Next Python Langchain OpenAI AssemblyAI Pinecone


        👤Contact

        Email Twitter