Wednesday, April 26, 2023

Generating a Shakespeare corpus for full-text searching from JSON

 (this is a placeholder, I will format the following more beautifully later)

Files are here: https://github.com/marhar/duckdb_tools/tree/main/full-text-shakespeare

mh duckdb_tools/full-text-shakespeare% cat generate-shakespeare-csv.sql

.echo on

-- We want to create a simple "flat" table from the nested structure here:

--    'https://raw.githubusercontent.com/grokify/kibana-tutorial-go/master/shakespeare.json',

--

-- In addition, we need to generate a unique identifier for each line, since that is

-- required for full text search to work properly.


create table corpus as select 'x' as line_id, play_name, line_number,speaker, text_entry

  from read_json(

         'https://raw.githubusercontent.com/grokify/kibana-tutorial-go/master/shakespeare.json',

         auto_detect=true)

   where play_name is not null and line_number is not null and speaker is not null and text_entry is not null;


-- Get rid of stage directions, Act and Scene number headers, etc.


delete from corpus where line_number = '';


-- Chat GPT claims these abbreviations are commonly used by Shakespeare scholars,

-- and helpfully provides this information in the form of an update.

-- Hope Chat GPT isn't hallucinating about this!


UPDATE corpus

SET line_id = 

    CASE 

        WHEN play_name = 'Alls well that ends well' THEN 'AWW/' || line_number

        WHEN play_name = 'Antony and Cleopatra' THEN 'A&C/' || line_number

        WHEN play_name = 'As you like it' THEN 'AYLI/' || line_number

        WHEN play_name = 'A Comedy of Errors' THEN 'CE/' || line_number

        WHEN play_name = 'Coriolanus' THEN 'COR/' || line_number

        WHEN play_name = 'Cymbeline' THEN 'CYM/' || line_number

        WHEN play_name = 'Hamlet' THEN 'HAM/' || line_number

        WHEN play_name = 'Henry IV' THEN 'H4/' || line_number

        WHEN play_name = 'Henry VI Part 1' THEN '1H6/' || line_number

        WHEN play_name = 'Henry VI Part 2' THEN '2H6/' || line_number

        WHEN play_name = 'Henry VI Part 3' THEN '3H6/' || line_number

        WHEN play_name = 'Henry V' THEN 'H5/' || line_number

        WHEN play_name = 'Henry VIII' THEN 'H8/' || line_number

        WHEN play_name = 'Julius Caesar' THEN 'JC/' || line_number

        WHEN play_name = 'King John' THEN 'KJ/' || line_number

        WHEN play_name = 'King Lear' THEN 'KL/' || line_number

        WHEN play_name = 'Loves Labours Lost' THEN 'LLL/' || line_number

        WHEN play_name = 'Macbeth' THEN 'MAC/' || line_number

        WHEN play_name = 'Measure for measure' THEN 'MM/' || line_number

        WHEN play_name = 'Merchant of Venice' THEN 'MV/' || line_number

        WHEN play_name = 'Merry Wives of Windsor' THEN 'MWW/' || line_number

        WHEN play_name = 'A Midsummer nights dream' THEN 'MSND/' || line_number

        WHEN play_name = 'Much Ado about nothing' THEN 'MAAN/' || line_number

        WHEN play_name = 'Othello' THEN 'OTH/' || line_number

        WHEN play_name = 'Pericles' THEN 'PER/' || line_number

        WHEN play_name = 'Richard II' THEN 'R2/' || line_number

        WHEN play_name = 'Richard III' THEN 'R3/' || line_number

        WHEN play_name = 'Romeo and Juliet' THEN 'R&J/' || line_number

        WHEN play_name = 'Taming of the Shrew' THEN 'TOTS/' || line_number

        WHEN play_name = 'The Tempest' THEN 'TEMP/' || line_number

        ELSE 'unknown/' || line_number

    END;


-- Finally, save the data in whatever format you like.

-- Or just use it in-memory!


copy corpus TO 'shakespeare.csv' (header, delimiter '|');

copy corpus TO 'shakespeare.parquet' (format parquet, compression zstd);

mh duckdb_tools/full-text-shakespeare% 

Friday, April 21, 2023

DuckDB ASOF JOIN

Here's a quick example of duckdb's new ASOF JOIN.

Problem: we have a time-based price table; if you have a sale time that falls in the gaps, you get NULL price and total.

Solution: "ASOF JOIN" picks a good value for "in the gap" sales.


Create a price table and sales table.


create table prices as (

    SELECT '2001-01-01 00:16:00'::TIMESTAMP + INTERVAL (v) MINUTE AS ticker_time,

        v as unit_price

    FROM range(0,5) vals(v)

);


create table sales(item text, sale_time timestamp, quantity int);

insert into sales values('a', '2001-01-01 00:18:00', 10);

insert into sales values('b', '2001-01-01 00:18:30', 20);

insert into sales values('c', '2001-01-01 00:19:00', 30);


select * from prices;

┌─────────────────────┬────────────┐

│     ticker_time     │ unit_price │

│      timestamp      │   int64    │

├─────────────────────┼────────────┤

│ 2001-01-01 00:16:00 │          0 │

│ 2001-01-01 00:17:00 │          1 │

│ 2001-01-01 00:18:00 │          2 │ No unit_price for 18:30!

│ 2001-01-01 00:19:00 │          3 │

│ 2001-01-01 00:20:00 │          4 │

└─────────────────────┴────────────┘

select * from sales;

┌─────────┬─────────────────────┬──────────┐

│  item   │      sale_time      │ quantity │

│ varchar │      timestamp      │  int32   │

├─────────┼─────────────────────┼──────────┤

│ a       │ 2001-01-01 00:18:00 │       10 │

│ b       │ 2001-01-01 00:18:30 │       20 │

│ c       │ 2001-01-01 00:19:00 │       30 │

└─────────┴─────────────────────┴──────────┘

Normal Left Join, problem for the 18:30 sale!


-- no price value for 18:30, so item b's unit_price and total are NULL!


select s.*, p.unit_price, s.quantity * p.unit_price as total

 from sales s left join prices p

   on s.sale_time = p.ticker_time;


┌─────────┬─────────────────────┬──────────┬────────────┬───────┐

│  item   │      sale_time      │ quantity │ unit_price │ total │

│ varchar │      timestamp      │  int32   │   int64    │ int64 │

├─────────┼─────────────────────┼──────────┼────────────┼───────┤

│ a       │ 2001-01-01 00:18:00 │       10 │          2 │    20 │

│ c       │ 2001-01-01 00:19:00 │       30 │          3 │    90 │

│ b       │ 2001-01-01 00:18:30 │       20 │       NULL │  NULL │

└─────────┴─────────────────────┴──────────┴────────────┴───────┘

ASOF Join, picks a good price for the 18:30 sale!


-- using asof, 18:30 "rounds down" to use the 18:00 unit_price


select s.*, p.unit_price, s.quantity * p.unit_price as total_cost

  from sales s asof left join prices p

    on s.sale_time >= p.ticker_time;


┌─────────┬─────────────────────┬──────────┬────────────┬────────────┐

│  item   │      sale_time      │ quantity │ unit_price │ total_cost │

│ varchar │      timestamp      │  int32   │   int64    │   int64    │

├─────────┼─────────────────────┼──────────┼────────────┼────────────┤

│ a       │ 2001-01-01 00:18:00 │       10 │          2 │         20 │

│ b       │ 2001-01-01 00:18:30 │       20 │          2 │         40 │

│ c       │ 2001-01-01 00:19:00 │       30 │          3 │         90 │

└─────────┴─────────────────────┴──────────┴────────────┴────────────┘


DuckDB Full Text Search

 I learned a lot from Laurens Kuiper's blog post Testing out DuckDB's Full Text Search Extension, but unfortunately couldn't find the data set he was using for testing.

So, here's a simple "hello world" example using the easily accessible (and parsable!) kjv.txt. Note that everything past the table creation can be done in the duckdb cli.

% cat duckdb-full-text-search.py      

import re
import duckdb

# -------- prepare the data -----------
fd = open('kjv.txt') # https://www.o-bible.com/download/kjv.txt
fd.readline() # skip first line
data = []
for line in fd.readlines():
    line = line.rstrip()
    # book, chap, verse, body = re.match(r'(\d?[A-Za-z]+)(\d+):(\d+)\s+(.*)', line).groups()
    ref, body = re.match(r'(\d?[A-Za-z]+\d+:\d+)\s+(.*)', line).groups()
    data.append((ref,body,))

# -------- create the table -----------
db = duckdb.connect()
db.cursor().execute("create table corpus(ref text, body text)")
db.cursor().executemany("insert into corpus(ref, body) values($1, $2)", data)

# -------- everything below could be run in the duckdb cli ---------

# -------- create the index -----------
db.cursor().execute(
    """
    install 'fts';
    load fts;
    pragma create_fts_index('corpus', 'ref', 'ref', 'body');
    """)

# -------- full text query -----------
print(db.sql("""
    select fts_main_corpus.match_bm25(ref, 'whale') as score,
      ref, body as "search for 'whale'"
    from corpus
    where score is not null
    order by score;
    """))

% python3 ./duckdb-full-text-search.py
┌───────────────────┬──────────┬───────────────────────────────────────────────┐
│       score       │   ref    │              search for 'whale'               │
│      double       │ varchar  │                    varchar                    │
├───────────────────┼──────────┼───────────────────────────────────────────────┤
│   2.7248255618541 │ Eze32:2  │ Son of man, take up a lamentation for Phara…  │
│ 3.839526928067141 │ Ge1:21   │ And God created great whales, and every liv…  │
│ 3.839526928067141 │ Mat12:40 │ For as Jonas was three days and three night…  │
│ 6.497660955190547 │ Job7:12  │ Am I a sea, or a whale, that thou settest a…  │
└───────────────────┴──────────┴───────────────────────────────────────────────┘

Simple DuckDB Interface

I'm starting to collect some useful DuckDB stuff here: 

    https://github.com/marhar/duckdb_tools

The first tool is duckwatch, which allows you to edit a SQL file and see the output continuously updated in another window.  It's a poor programmer's IDE!

You can run it a couple of ways:

$ duckwatch [-e] my-file.sql.        # runs your commands in a transient db.
                                     # every time you save your sql file, the
                                     # file is re-executed from scratch.
$ duckwatch [-e]my-file.sql my.db.   # same, but runs it using this db file.
                                     # changes are persistent.
$

Running with "-e" pops open a new window and puts you in an editor.  If you don't want that, just edit the sql file in your favorite editor, and whenever you save the file the SQL code will be re-executed.



Setting up DuckDB/Rust

Here's what I did to get DuckDB/rust going on my system. Prerequisites We need two things from DuckDB: header files (so we can build duc...