vdutts7

EE16B AI Chatbot

EE16B AI Chatbot ~ trained on official course website

📝 About

More natural way to help students study for exams, review weekly content, and customize learnings to recreate similar problems etc to their prefernce. Trained on the weekly Notes for EE16B. It's like talking to your professor. Good for those who suck at taking notes. EE16B students, staff, and more generally anyone can clone and use this repo and adjust to their liking.

🔗 Official Course Website
UC Berkeley 🐻🔵🟡 • EE16B: Designing Information Devices and Systems II • Spring 2023

💻 How to Build

Note: macOS version, adjust accordingly for Windows / Linux

Initial setup

Clone the repo and install dependencies.

git clone https://github.com/vdutts7/ee16b-ai-chat
cd ee16b-ai-chat
pnpm install

Create a .env file and add your API keys (refer .env.local.example for this template):

OPENAI_API_KEY=""
NEXT_PUBLIC_SUPABASE_URL=""
NEXT_PUBLIC_SUPABASE_ANON_KEY=""
SUPABASE_SERVICE_ROLE_KEY=""

Get API keys:

IMPORTANT: Verify that .gitignore contains .env in it.

Prepare Supabase environment

I used Supabase as my vectorstore. Alternatives: Pinecone, Qdrant, Weaviate, Chroma, etc

You should have already created a Supabase project to get your API keys. Inside the project's SQL editor, create a new query and run the schema.sql. You should now have a documents table created with 4 columns.

Embed and upsert

Inside the config folder is the transcripts folder with all lectures as .txt files and the corresponding JSON files for the metadatas. .txt files were scraped from the lecture recordings separately ahead of time but OpenAI's Whisper is a great package for Speech-to-Text transcription). Change according to preferences. pageContent and metadata are by default stored in Supabase along with an int8 type for the 'id' column.

Manually run the embed-script.ipynb notebook in the scripts folder OR run the package script from terminal:

npm run embed

This is a one-time process and depending on size of data you wish to upsert, it can take a few minutes. Check Supabase database to see updates reflected in the rows of your table there.

Technical explanation

This code performs the following:

Installs the supabase Python library using pip. This allows interaction with a Supabase database.
Loads various libraries:
supabase - For interacting with Supabase
langchain - For text processing and vectorization
json - For loading JSON metadata files
Loads the Supabase URL and API key from .env. This is used to create a supabase_client to connect to the Supabase database.
Loads text data from .txt lecture transcripts and JSON metadata files.
Uses a RecursiveCharacterTextSplitter to split the lecture text into chunks. This allows breaking the text into manageable pieces for processing. Chunk size and chunk overlap can be changed according to preference and basically control the amount of specificity. A larger chunk size and smaller overlap will result in fewer, broader chunks, while a smaller chunk size and larger overlap will produce more, narrower chunks.
Creates OpenAI text-embedding-ada-002 embeddings. This makes several vectors of 1536 dimensionality optimized for cosine similarity searches. These vectors are then combined with the metadata in the JSON files along with other lecture-specific info and upserted to the database as vector embeddings in row tabular format i.e. a SupabaseVectorStore.