paint-brush
What The Heck Is DuckDB?by@progrockrec
9,042 reads
9,042 reads

What The Heck Is DuckDB?

by Shawn GordonJuly 26th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

DuckDB is a column-based data store that uses vectorized processing to optimize OLAP workloads. DuckDB Labs is a commercial company formed by the creators of DuckDB in July 2021 to provide support, custom extensions, and even custom versions of the product as a way to monetize it. It is an MIT-licensed open source project whose code is hosted on GitHub and is hosted in the wild. The product is described as a mutant offspring of SQLite and Redshift. The DuckDB code is available in a web-based shell called “duckdb.org”

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - What The Heck Is DuckDB?
Shawn Gordon HackerNoon profile picture

Introduction

I like to spend time on Linkedin reading through posts from companies about news, new releases, funding, new products, etc.. During a recent perusal, I saw something about “DuckDB”, and as much as I pay attention to the data space, I hadn’t heard of this one, so I thought I’d investigate and answer the question “What the heck is DuckDB?”.

Overview

DuckDB itself is an MIT-licensed open source project whose code is hosted on GitHub. DuckDB Labs is a commercial company formed by the creators of DuckDB in July 2021 to provide support, custom extensions, and even custom versions of the product as a way to monetize it. This model reminds me of the early days of open source monetization and is one I like.


DuckDB is briefly described as SQLite for analytic workloads. While SQLite is an embeddable, row-based, and b-tree indexed data store well suited for OLTP workloads, DuckDB is an embeddable column-based data store that uses vectorized processing to optimize OLAP workloads, you could also think of it as an embedded Amazon Redshift or a mutant offspring of SQLite and Redshift. Some of the features are:


  • Simple installation
  • Single-File storage format
  • No server
  • Fast processing
  • Language library integrations
  • Not reliant on any external config files or settings
  • Programmatic SQL API
  • Fully ACID compliant
  • WASM (web assembly) version available


OLTP

OLAP

For your operation workloads

For your analytic workloads

Shorter queries

Longer queries for complex questions

Tables are more highly normalized

Tables are de-normalized

Typically implemented as row-oriented data stores

Typically implemented as column-oriented data stores

Testing

I decided to use their very clever WASM web-based shell to try out querying some Parquet files, of which I grabbed some to play with from here. I started with the “.files add” command to load up the parquet file:


Then I did some basic SQL to check it out:

duckdb> select count(*) from userdata1.parquet;
┌──────────────┐
│ count_star() │
╞══════════════╡
│         1000 │
└──────────────┘
Elapsed: 1 ms

duckdb> 

duckdb> select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌────────────┬───────────┬────────────────────────────┐
│ first_name ┆ last_name ┆ email                      │
╞════════════╪═══════════╪════════════════════════════╡
│ Emily      ┆ Stewart   ┆ estewart9@opensource.org   │
│ Annie      ┆ Torres    ┆ atorrest@ning.com          │
│ William    ┆ Green     ┆ wgreen63@phpbb.com         │
│ Jack       ┆ Medina    ┆ jmedina7y@fda.gov          │
│ Jeremy     ┆ Bennett   ┆ jbennettck@wikipedia.org   │
│ Carlos     ┆ Day       ┆ cdaycn@gravatar.com        │
│ Ryan       ┆ Mills     ┆ rmillsgj@angelfire.com     │
│ Betty      ┆ Gibson    ┆ bgibsonka@tamu.edu         │
│ Wanda      ┆ Stanley   ┆ wstanleymz@sourceforge.net │
│ Evelyn     ┆ Spencer   ┆ espencerpi@ted.com         │
│ George     ┆ Howard    ┆ ghowardqh@mapquest.com     │
└────────────┴───────────┴────────────────────────────┘
Elapsed: 2 ms


You can even do an ‘explain’:

duckdb> explain select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         first_name        │
│         last_name         │
│           email           │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          country          │
│         first_name        │
│         last_name         │
│           email           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  Filters: country=Nigeria │
│   AND country IS NOT NULL │
└───────────────────────────┘     


A very nice feature with DuckDB is if you are working with Python for example, you can just add it as a library by adding “import duckdb” to your python script, and then it is in your python process, so it then feels very integrated into your program, unlike working with something like MySQL or Postgres. So, using our userdata1.parquet file, we could do something like this:


import duckdb
myconnector = duckdb.connect('myduckdb.duckdb')
cursor = myconnector.cursor()
cursor.execute("""
CREATE TABLE userdata(
   registration_dttm  date,
   Id                 int,
   first_name         varchar, 
   Last_name          varchar,
   email              varchar,
   gender             varchar,
   ip_address         varchar,
   cc                 varchar,
   country            varchar,
   birthdate          varchar,
   salary             float,
   title              varchar,
   comments 
)
"""
)
 
cursor.execute("COPY userdata FROM 'userdata1.parquet' (HEADER)")
 
print(cursor.execute('select count(*) from userdata).fetchall())
cursor.close()
conn.close()

In the code snippet shown above, we connect to the ‘myduckdb.duckdb’ database,  create a table that matches our parquet file, copy the data into it and then perform a simple count query.

Summary

This is a really cool project. While I’ve been aware of the advantages of columnar data stores for about eight years because of Sisense initially, I only started working with them more extensively in the past year. I’m especially excited by their WASM implementation and the clever things they did with Arrow as a data protocol for the data import and query results. Not only is it a great technical example of WASM, but it has some great utility as well. Then, as I was wrapping this up, I ran across a , which was the first DuckDB user group meeting from April 2022 that you might want to check out. Lots of great things on the roadmap. As a final note, I have nothing to do with this project or the people behind it, they don’t even know I wrote this, so if I seem like a bit of a fanboy, it’s because I’m legitimately very keen on the project.