EastBay DuckDB

Wednesday, January 10, 2024

Duck-GPT -- Adding DuckDB Documents to an LLM

tl;dr: An LLM that is up to date on DuckDB!

tl;dr2: My Hello World project for integrating new data into an existing language model.

Chat GPT and friends are great at writing SQL. The only problem wrt DuckDB is that GPT's knowledge lags the real world, so there is a bunch of innovation in the DuckDB space that GPT doesn't know about.

So, you often as a question and specify IN DUCKDB, and GPT mildly hallucinates and it gives you a MySQL answer, a SQL Server answer, an answer using a Python function etc. This is no problem for a lot of generic-style SQL queries, but it's frustrating that it doesn't know about the latest features.

So, I'm going to try building my own local LLM that incorporates the (most excellent) DuckDB documentation. I've experimented with pasting in doc pages to Chat GPT, and including them as part of the system prompt in a local llama2 instance.

The initial results were promising... I was asking about using the DuckDB autocomplete package, and both gave some really insightful answers... even noting that I could do a lateral join of possible completion strings with a regexp to pull out the word being completed, and use that to narrow down the completion choice for each keystroke. Neat!

So, I'm going to kick off my learning project here, with the hope that it will be useful to others, and perhaps interested people might feed back some good tips -- I'm definitely no expert at this!

Roughly speaking, I would like to

start with some popular LLM that is reasonably good at SQL
add in whatever knowledge I can from the DuckDB documentation
wrap that up into a "Chat-DuckDB" model that I can upload for other people to use

Additionally, I thought two other things might be interesting:

since the DuckDB folks keep their old versions of the documentation online, let's add all the old versions of the documents, and try to make the DuckDB chatbot aware of changes over time... "When did DuckDB add feature X", etc.
In a previous post, I collected my notes for creating a DuckDB full text index for Shakespeare. It would be neat if that could be used to answer questions about Shakespeare, and pull example texts out of DuckDB (so you get "real" corpus text, not possibly hallucinations, paraphrases, etc).

On a practical note, I had a chance to chat briefly with llamabot author Eric Ma about this, and he mentioned that the upcoming release will have some RAG based features which seem to be just what I was looking for. I have some previous experience with llamabot, and thought is was very nicely put together, so I'll be going down that route.

So the next couple of posts, stand by for Duck-GPT... Quack Quack!

Sunday, August 20, 2023

Setting up DuckDB/Rust

Here's what I did to get DuckDB/rust going on my system.

Prerequisites

We need two things from DuckDB: header files (so we can build duckdb-rs) and the library (so duckdb-rs can call the DuckDB functions). We will get these by building DuckDB. At the time of writing , version 0.8.1 is the DuckDB version supported by duckdb-rs.

We will clone the DuckDB repo, check out version 0.8.1, and build it.

mkdir $HOME/duck81

cd $HOME/duck81

git clone https://github.com/duckdb/duckdb.git

cd duckdb

git checkout tags/v0.8.1

make

Next, we need a couple of environment variables. These are used when crate builds the duckdb-rs crate.

export DUCKDB_LIB_DIR=$HOME/duck81/duckdb/build/release/src

export DUCKDB_INCLUDE_DIR=$HOME/duck81/duckdb/src/include

Finally, we need to tell the OS where the dynamic libraries are located.

# macos

export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:$DUCKDB_LIB_DIR

# linux

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DUCKDB_LIB_DIR

This setup is fine for development but not for production use. If you are shipping rust in production, you know how to package shared libraries, etc. There are some packaging notes on the duckdb-rs github page.

Cargo.toml

Add this line to your dependencies section.

[dependencies]

duckdb = "0.8.1"

Sample Code

Sample Code

Here's a lightly modified version of the example code on the duckdb-rs github page.

use duckdb::{params, Connection, Result};

use duckdb::arrow::record_batch::RecordBatch;

use duckdb::arrow::util::pretty::print_batches;

#[derive(Debug)]

struct Person {

id: i32,

name: String,

data: Option<Vec<u8>>,

}

fn main() -> Result<()> {

let conn = Connection::open_in_memory()?;

conn.execute_batch(r"

CREATE SEQUENCE seq;

CREATE TABLE person (

id INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq'),

name TEXT NOT NULL,

data BLOB

);

")?;

conn.execute("INSERT INTO person (name, data) VALUES (?, ?)",

params!["able", "aaa"],)?;

conn.execute("INSERT INTO person (name, data) VALUES (?, ?)",

params!["baker", "bbb"],)?;

conn.execute("INSERT INTO person (name, data) VALUES (?, ?)",

params!["charlie", "ccc"],)?;

conn.execute("INSERT INTO person (name, data) VALUES (?, ?)",

params!["dog", "ddd"],)?;

println!("query table by rows");

let mut stmt = conn.prepare("SELECT id, name, data FROM person")?;

let person_iter = stmt.query_map([], |row| {

Ok(Person {

id: row.get(0)?,

name: row.get(1)?,

data: row.get(2)?,

})

})?;

for person in person_iter {

println!("Found person {:?}", person.unwrap());

}

println!("query table by arrow");

let rbs: Vec<RecordBatch> = stmt.query_arrow([])?.collect();

print_batches(&rbs).unwrap();

Ok(())

}

Execution
We can ignore the (incorrect) warning about the People fields never being read.

% cargo run

Compiling duck1 v0.1.0 (/Users/neologic/biblio/anachron/duck1)

warning: fields `id`, `name`, and `data` are never read

--> src/main.rs:7:5

6 | struct Person {

| ------ fields in this struct

7 | id: i32,

| ^^

8 | name: String,

| ^^^^

9 | data: Option<Vec<u8>>,

| ^^^^

= note: `Person` has a derived impl for the trait `Debug`, but this is intentionally ignored during dead code analysis

= note: `#[warn(dead_code)]` on by default

warning: `duck1` (bin "duck1") generated 1 warning

Finished dev [unoptimized + debuginfo] target(s) in 1.24s

Running `target/debug/duck1`

query table by rows

Found person Person { id: 1, name: "able", data: Some([97, 97, 97]) }

Found person Person { id: 2, name: "baker", data: Some([98, 98, 98]) }

Found person Person { id: 3, name: "charlie", data: Some([99, 99, 99]) }

Found person Person { id: 4, name: "dog", data: Some([100, 100, 100]) }

query table by arrow

+----+---------+--------+

| id | name | data |

+----+---------+--------+

| 1 | able | 616161 |

| 2 | baker | 626262 |

| 3 | charlie | 636363 |

| 4 | dog | 646464 |

+----+---------+--------+

Wednesday, May 24, 2023

Query Diffing

Given tables or query results A and B, we want to identify:

inserted rows: The key appears in B, but not in A.
deleted rows: The key appears in A, but not in B.
modified: The key appears in both tables, but with different values.
unmodified: the key appears in both tables, and the values are the same.

For example, given the data below, identify:

a is modified, 1 --> 9
b is deleted
c is unmodified
d is inserted

CREATE TABLE A AS

SELECT 'a' AS key, 1 AS val

UNION SELECT 'b', 2

UNION SELECT 'c', 3;

CREATE TABLE b AS

SELECT 'a' AS key, 9 AS val

UNION SELECT 'c', 3

UNION SELECT 'd', 4;

SELECT * FROM a ORDER BY KEY;

a,1

b,2

c,3

SELECT * FROM b ORDER BY KEY;

a,9

c,3

d,4

These queries will perform the specified operations.

Modified Rows

select b.key, b.val from a left join b

on a.key = b.key where a.val <> b.val;

a,9

Deleted Rows

select key, val from a where key not in (select key from b);

b,2

Unmodified Rows

select key, val from a intersect select key, val from b;

c,3

Inserted Rows

select key, val from b where key not in (select key from a);

d,4

Note that if you have multiple columns in your key, you can specify them as (`key1, key2, ...), etc.

For example, if the tables had two key columns the Inserted query would be

select key1, key2, val from b

where (key1, key2) not in (select key1, key2 from a);

Friday, May 19, 2023

DuckDB Date Sampler

WITH generate_date AS (

SELECT CAST(RANGE AS DATE) AS date_key

FROM RANGE(DATE '2009-01-01', DATE '2013-12-31', INTERVAL 1 DAY)

)

SELECT date_key AS date_key,

DAYOFYEAR(date_key) AS day_of_year,

YEARWEEK(date_key) AS week_key,

WEEKOFYEAR(date_key) AS week_of_year,

DAYOFWEEK(date_key) AS day_of_week,

ISODOW(date_key) AS iso_day_of_week,

DAYNAME(date_key) AS day_name,

DATE_TRUNC('week', date_key) AS first_day_of_week,

DATE_TRUNC('week', date_key) + 6 AS last_day_of_week,

YEAR(date_key) || RIGHT('0' || MONTH(date_key), 2) AS month_key,

MONTH(date_key) AS month_of_year,

DAYOFMONTH(date_key) AS day_of_month,

LEFT(MONTHNAME(date_key), 3) AS month_name_short,

MONTHNAME(date_key) AS month_name,

DATE_TRUNC('month', date_key) AS first_day_of_month,

LAST_DAY(date_key) AS last_day_of_month,

CAST(YEAR(date_key) || QUARTER(date_key) AS INT) AS quarter_key,

QUARTER(date_key) AS quarter_of_year,

CAST(date_key - DATE_TRUNC('Quarter', date_key) + 1 AS INT) AS day_of_quarter,

('Q' || QUARTER(date_key)) AS quarter_desc_short,

('Quarter ' || QUARTER(date_key)) AS quarter_desc,

DATE_TRUNC('quarter', date_key) AS first_day_of_quarter,

LAST_DAY(DATE_TRUNC('quarter', date_key) + INTERVAL 2 MONTH) as last_day_of_quarter,

CAST(YEAR(date_key) AS INT) AS year_key,

DATE_TRUNC('Year', date_key) AS first_day_of_year,

DATE_TRUNC('Year', date_key) - 1 + INTERVAL 1 YEAR AS last_day_of_year,

ROW_NUMBER() OVER

(PARTITION BY YEAR(date_key), MONTH(date_key), DAYOFWEEK(date_key) ORDER BY date_key)

AS ordinal_weekday_of_month

FROM generate_date;

(credit)

EastBay DuckDB

Wednesday, January 10, 2024

Duck-GPT -- Adding DuckDB Documents to an LLM

Sunday, August 20, 2023

Setting up DuckDB/Rust

Wednesday, May 24, 2023

Query Diffing

Friday, May 19, 2023

DuckDB Date Sampler

Duck-GPT -- Adding DuckDB Documents to an LLM

Report Abuse

Labels