Wednesday, January 10, 2024

Duck-GPT -- Adding DuckDB Documents to an LLM

tl;dr: An LLM that is up to date on DuckDB!

tl;dr2: My Hello World project for integrating new data into an existing language model.

Chat GPT and friends are great at writing SQL.  The only problem wrt DuckDB is that GPT's knowledge lags the real world, so there is a bunch of innovation in the DuckDB space that GPT doesn't know about.

So, you often as a question and specify IN DUCKDB, and GPT mildly hallucinates and it gives you a MySQL answer, a SQL Server answer, an answer using a Python function etc.  This is no problem for a lot of generic-style SQL queries, but it's frustrating that it doesn't know about the latest features.

So, I'm going to try building my own local LLM that incorporates the (most excellent) DuckDB documentation.  I've experimented with pasting in doc pages to Chat GPT, and including them as part of the system prompt in a local llama2 instance.

The initial results were promising... I was asking about using the DuckDB autocomplete package, and both gave some really insightful answers... even noting that I could do a lateral join of possible completion strings with a regexp to pull out the word being completed, and use that to narrow down the completion choice for each keystroke.  Neat!

So, I'm going to kick off my learning project here, with the hope that it will be useful to others, and perhaps interested people might feed back some good tips -- I'm definitely no expert at this!

Roughly speaking, I would like to

  • start with some popular LLM that is reasonably good at SQL
  • add in whatever knowledge I can from the DuckDB documentation
  • wrap that up into a "Chat-DuckDB" model that I can upload for other people to use
Additionally, I thought two other things might be interesting:
  •  since the DuckDB folks keep their old versions of the documentation online, let's add all the old versions of the documents, and try to make the DuckDB chatbot aware of changes over time... "When did DuckDB add feature X", etc.
  • In a previous post, I collected my notes for creating a DuckDB full text index for Shakespeare.  It would be neat if that could be used to answer questions about Shakespeare, and pull example texts out of DuckDB (so you get "real" corpus text, not possibly hallucinations, paraphrases, etc).

On a practical note, I had a chance to chat briefly with llamabot author Eric Ma about this, and he mentioned that the upcoming release will have some RAG based features which seem to be just what I was looking for.  I have some previous experience with llamabot, and thought is was very nicely put together, so I'll be going down that route.

So the next couple of posts, stand by for Duck-GPT... Quack Quack!

No comments:

Post a Comment

Duck-GPT -- Adding DuckDB Documents to an LLM

tl;dr: An LLM that is up to date on DuckDB! tl;dr2: My Hello World project for integrating new data into an existing language model. Chat GP...