Part 1 in our SQL parsing journey with TypeScript
Building a TypeScript SQL Parser
It all begins with an ambitious idea
Hello and welcome to this series of articles about my exploration of TypeScript while building a SQL Parser. I had previously used TypeScript at work, but never felt like I spent enough time learning the language beyond how it was used for my daily tasks. When trying to really get to know a new language, I like to pick an area where I already have quite a bit of domain knowledge and see much I can implement. SQL parsing felt like a natural fit with TypeScript given its focus on compile time feedback and a rich type system.
Making Sure It’s A Challenge
There are already a lot of great TypeScript parsers out there already. This is great for checking what is possible or if we get stuck, but I personally find building from scratch to be a better learning experience. There are also a few other things which might be better to explore by not having to take on an existing structure. With that in mind, here are a few requirements that We’ll adhere to unless something is absolutely impossible:
- We will be able to express database schemas through both types and builders that provide concrete implementations.
- We will be able to build and verify a SQL query against a schema to ensure that we have valid, compile time queries.
- We will be able to interpret raw SQL strings as queries that can be validated against a schema.
- We will be able to extend the parsing to provide extensions for SQL engine variants.
- We will be able to identify changes in schemas and provide migrations between versions.
- We will make this a self contained library that does not have any external dependencies.
As of right now, I think all of this is achievable but it’s going to take a lot of work and I’m not entirely sure how to do everything just yet.
Getting Started
I’m going to keep all of the code in linked repository below with branches to keep track of progress between articles. To begin, we will create a simple TypeScript project with just a few utilities to allow us to do testing, linting and code formatting. For this first article, we’re simply going to create the building blocks needed to start work on a parser in the future.
The first task is going to be creating a shared understanding of SQL that all of the components will use. One of the problems with choosing SQL is that there are a lot of concepts that databases mostly agree upon (SELECT, INSERT, UPDATE, DELETE, etc.) However, there are a lot of others where they don’t (MERGE isn’t in every DBMS, Postgres supports a bunch of additional JSON syntax, etc.) Eventually we’ll need to be able to extend and support these scenarios.
Most parsing and translation components I’ve worked with end up representing these complicated concepts as an Abstract Syntax Tree (AST). The trees then are passed around instead of the exact expression allowing different components to focus on what they care about.
Abstract Syntax Trees (AST)
A quick primer on abstract syntax trees will help explain the approach we are taking with this SQL parser. The goal of an AST is to represent the structural components of some problem space (expression, program, etc.) in a more general form that can be processed independently of the original form. The structure is expressed as a tree of typed nodes and branches that often look like the image below which represents the simple script isPanda('🐼');
Each node has a specific type with it’s own structure and can have nested nodes through properties, allowing for the full syntax of the component to be represented. It is abstract in the sense that the individual nodes themselves get their larger meaning from the full tree itself (ex: {type: ‘Identifier', name: ‘isPanda'}
doesn’t give much context about usage by itself). This interpretation model allows the tree to define any problem in the space that the typed nodes themselves can express.
In the parser AST nodes will represent the types of values and operations that can be found in a valid SQL statement. If you want to dive deeper into how an AST works you check out the linked article from Twillo above or some other great introductory articles on Medium like this one:
SQL Basic Types
To start off, we will need some basic types that are supported in SQL and a way to translate those back and forth in TypeScript.
/**
* The set of built-in types for the SQL spec that are supported in this library
*/
export enum SQLBuiltinTypes {
BINARY = "binary",
BIT = "bit",
BIGINT = "bigint",
BLOB = "blob",
CHAR = "char",
CLOB = "clob",
DATE = "date",
DATETIME = "datetime",
DECIMAL = "decimal",
FLOAT = "float",
IMAGE = "image",
INT = "int",
JSON = "json",
NCHAR = "nchar",
NTEXT = "ntext",
NUMERIC = "numeric",
NVARCHAR = "nvarchar",
REAL = "real",
SMALLINT = "smallint",
TEXT = "text",
TIME = "time",
TIMESTAMP = "timestamp",
TINYINT = "tinyint",
VARBINARY = "varbinary",
VARCHAR = "varchar",
XML = "xml",
YEAR = "year",
}
/** The set of types that are binary */
export type BinarySQLTypes =
| SQLBuiltinTypes.BINARY
| SQLBuiltinTypes.BLOB
| SQLBuiltinTypes.CLOB
| SQLBuiltinTypes.IMAGE
| SQLBuiltinTypes.VARBINARY
/** The set of types that are related to date processing */
export type DateSQLTypes =
| SQLBuiltinTypes.DATE
| SQLBuiltinTypes.DATETIME
| SQLBuiltinTypes.TIME
| SQLBuiltinTypes.TIMESTAMP
| SQLBuiltinTypes.YEAR
/** The set of numeric types that are potentially larger than a number can
* support */
export type BigIntSQLTypes = SQLBuiltinTypes.BIGINT | SQLBuiltinTypes.TIMESTAMP
/** The set of types that are numeric */
export type NumericSQLTypes =
| SQLBuiltinTypes.DECIMAL
| SQLBuiltinTypes.FLOAT
| SQLBuiltinTypes.INT
| SQLBuiltinTypes.NUMERIC
| SQLBuiltinTypes.REAL
| SQLBuiltinTypes.SMALLINT
| SQLBuiltinTypes.TINYINT
/** The set of types that are variable length */
export type VariableSQLTypes =
| SQLBuiltinTypes.NVARCHAR
| SQLBuiltinTypes.VARBINARY
| SQLBuiltinTypes.VARCHAR
/** The set of types that are incremental */
export type IncrementalSQLTypes =
| SQLBuiltinTypes.BIGINT
| SQLBuiltinTypes.DECIMAL
| SQLBuiltinTypes.FLOAT
| SQLBuiltinTypes.INT
/** Map the SQLBuiltinType to the valid JS type */
export type TSSQLType<T extends SQLBuiltinTypes> = [T] extends [BigIntSQLTypes]
? number | bigint
: [T] extends [BinarySQLTypes]
? Int8Array
: [T] extends [NumericSQLTypes]
? number
: [T] extends [SQLBuiltinTypes.BIT]
? boolean
: string
There are a couple things going on here so let’s dive into them real quick
SQLBuiltinTypes
has a collection of the most commonly supported SQL types. While there are extensions beyond these, if we can successfully support them we’ve covered a LOT of ground.- There are some exported types to classify the columns by high level groups based on how they are used in the database (Incremental, Variable, etc.) or how they affect primitive types in TypeScript like
BigIntSQLTypes
andBinarySQLTypes
. - We are leveraging a conditional type to map the given column types back into their TypeScript equivalents.
Conditional Types
This is one of the first things that I came across in TypeScript that made SQL parsing seem like a good fit. In essence, you can ask a really simple question to the type system: “If type A has a given relationship to type B then X else Y”. This ternary operator is really powerful and can be used to express all sorts of things, though there are a few caveats we will run into at some point.
- Every time the type compiler has to ask a question, it’s doing work that gets executed EVERY time the file/type/dependency is modified.
- There are depth limits to how far down you can chain operations before the compiler will give up and give an error for “excessively deep or infinite recursion”.
- We can only ask simple relational questions between two objects. All of the questions have to be basic and if you ask them in the wrong order, the compiler can get grumpy and stop, or worse gobble up your CPU.
Defining Our SQL AST
An AST should be able to represent the full set of valid objects and there are a lot of different options in the SQL world we could decide to support. For an example of the scope of just a simple SELECT, the following scenarios are all valid:
- WITH clauses can include fully formed queries that are referenced by name in future segments of a query as if they were a normal table.
- FROM clauses can also reference a subquery as a table instead of using the WITH/Common Table Expression(CTE) syntax.
- WHERE clauses can include filters based on subquery results like: ANY, ALL, EXISTS.
- The results can be GROUPED, filtered by a HAVING clause, ORDERED and LIMITED among other operations.
- Columns and tables can be aliased with different names changing the shape of the data and allowing self referencing.
- Multiple SELECT statements can be chained together via the UNION, INTERSECT or EXCEPT/MINUS clauses.
In addition to all the fun with SELECT, you can INSERT from the results of another query and even use the RETURNING
keyword to return rows from UPDATE/INSERT/DELETE which can then be used in CTE/WITH or subqueries. 😱
Luckily we can represent almost all of these with just a few simple concepts and some generics/union types. As we define these types though it will quickly become apparent that they need to recursively reference themselves. Luckily, TypeScript does support recursive types that we can use to our advantage as long as we are careful about avoiding “infinite” or “excessively deep” scenarios.
Recursive Types
A type is allowed to self reference itself directly, but not as a circular reference. As a simple example, if we were going to describe some information in a Tree structure, we could use this as a valid type:
type TreeNode<T> = {
left?: TreeNode<T>
right?: TreeNode<T>
value?: T
parent?: TreeNode<T> // <-- This is valid
}
This is great because we can express a bunch of information and explore a tree using just the definition with the single generic <T>
that keeps all the nodes of the tree using the same values and shape. However, the following would be invalid:
type Foo<T> = T
type Bar = Foo<Bar> // <-- Bar creates a circular reference to itself
Technically, we could use interfaces for the above to get it working since interfaces have some additional knowledge in the compiler layer but that’s just more restrictions we want to avoid if possible.
In our AST we will need to be able to have a root level query that can optionally have top level subqueries (CTE/WITH), which can themselves have nested named subqueries, which can have subqueries, which can…you get the point. It’s completely understandable that the compiler will at some point complain that we are giving it something potentially infinite to recurse through. It can and also will complain about us having circular references in our code since a SQLQuery
can have a SelectQuery
which can have a nested NamedQuery
which can have a SelectQuery
that breaks our no circular references restrictions.
One way in which we can deal with both of these cases is to eventually kill the type information by returning any
as a placeholder. This tells TypeScript not to worry about checking since literally anything could be used at that location and solves the circular reference problem. However this does have a side effect in causing our es-lint component to be VERY unhappy. As a general practice, the use of any
should be avoided where possible due to that defeating the entire point of the type system. If we try following the advice in some of es-lint in these situations to use unknown
keyword instead, it won’t work because unknown
, unlike any, does NOT satisfy the type constraints we put in place and cannot be inferred
.
Type Inference
One of the really cool things that TypeScript does for us is infer certain types based on the context it is being used in. This is great because we don’t necessarily have to know exactly what concrete types we are working with at any time. We can instead give some hints to the compiler through extends
clauses on our generic types to limit what is considered a valid input. However, this also has some limitations that we have to be aware of because inference is only as good as the current scope for the compiler. We don’t want to dive into all of those edge cases as they are as varied as the code you choose to write, but there are a couple of simple rules that will help us later on.
- The
any
keyword is something we want to avoid as much as possible because it tells the compiler “I don’t care what is here, let it through and stop checking”. If we mess up anywhere along the way, the compiler will happily tell us things are fine when it will in fact break later on. - If you explicitly give hints through return statements,
extends
keywords in our generics or fully qualified types, the compiler will use those tonarrow
the set of valid types. If you rely on inference, the number of valid types from a return can quickly get out of hand. - Because type context has to evaluate all of the possible valid combinations available (not all POSSIBLE combinations, just what is in the execution path) we want to limit possible places where type inference happens. Being explicit in what is being returned should be our default.
AST Types
For our AST, it has to represent a lot of different structures, so we should split our code into some manageable components. Since the tree is simply a sum of the nodes it contains we won’t be losing anything by splitting them out and it keeps the code cleaner. We’ll try to avoid areas where we introduce circular dependencies between these types as much as possible, mostly to reduce work for our compiler when we make changes.
In each of the sections below, we’ll take a look at a portion the AST and talk a little about why the components are shaped or split in a given way. We wont cover all of them but focus on some of the places where we made decisions based on the type system or something we want to leverage later.
Full AST code is in the repo branch below.
Value Types
In any AST we’re going to need some leaf nodes that terminate further exploration. Values are a perfect candidate for this since there isn’t much else you can do with a parameter or numeric value beyond identifying they exist.
/**
* The set of value types supported
*/
export type ValueTypes =
| BooleanValueType
| NumberValueType
| BigIntValueType
| BufferValueType
| StringValueType
| JsonValueType
| ArrayValueType
| NullValueType
| ParameterValueType
/**
* A parameter that is passed into the query at runtime
*/
export type ParameterValueType<Name extends string = string> = {
type: "ParameterValue"
value: Name
}
/**
* A {@link boolean} value
*/
export type BooleanValueType = {
type: "BooleanValue"
value: boolean
}
/**
* A {@link number} value
*/
export type NumberValueType = {
type: "NumberValue"
value: number
}
/**
* A {@link bigint} value
*/
export type BigIntValueType<B extends number | bigint = bigint> = {
type: "BigIntValue"
value: B
}
/**
* A {@link Uint8Array} value
*/
export type BufferValueType<B extends Uint8Array = Uint8Array> = {
type: "BufferValue"
value: B
}
/**
* A {@link string} value
*/
export type StringValueType = {
type: "StringValue"
value: string
}
/**
* An explicit `null` reference
*/
export type NullValueType = {
type: "NullValue"
value: null
}
/**
* A JSON value
*/
export type JsonValueType<J extends object = object> = {
type: "JsonValue"
value: J
}
/**
* An array value
*/
export type ArrayValueType<A extends unknown[] = unknown[]> = {
type: "ArrayValue"
value: A
}
We can see the general shape for our AST is that all nodes will contain a type
field that uniquely identifies the node shape. The second thing is that some of the nodes carry an additional generic type which we want to have available to inspect later. Reasons we might want to do this have more to do with the way that the type system works at compile time vs runtime which is worth chatting through a bit here.
TypeScript is awesome at helping us with types, but we have to remember that everything is transplined into raw JavaScript at the end of the day and all the types are erased. This means that later on, if we want to require that someone passes us an object with the correct type parameters, we can’t just use the ParameterValue.value
field since that’s only visible at runtime. If we want to ensure that we can see the difference between foo
and bar
as parameters, we have to ensure that we carry those values as part of the type like ParameterValue<'foo'>
and ParameterValue<'bar'>
.
This does have some consequences on the number of total types since every parameter generates a new type but it’s a tradeoff we have to make to get more information. We’ll run into this as lot as we get further down the parsing side but that’s a problem for another day. It does help hopefully explain why we have something like BooleanValue
as a simple value (we don’t care about true/false in most cases) but something like BigIntValue<number | bigint>
we do carry forward since typescript treats those differently once we exceed our Number.MAX_SAFE_INTEGER
threshold.
A More Complicated Example
We touched upon SELECT statements being a bit more complicated in terms of the logic that they support, so let’s take a look at that.
import type { OneOrMore } from "../type-utils/common.js"
import type { ColumnReference } from "./columns.js"
import type { LogicalExpression } from "./filtering.js"
import type { NamedQuery } from "./named.js"
import type { TableReference } from "./tables.js"
/**
* This is a helper type to instruct TypeScript to stop exploring the recursive
* chains that come from FROM clauses allowing queries which can have selects,
* completing a circular loop.
*/
// eslint-disable-next-line @typescript-eslint/no-explicit-any
type AnyNamedQuery = NamedQuery<any>
/**
* The supported join types
*/
export type JoinType = "LEFT" | "RIGHT" | "INNER" | "OUTER" | "LATERAL" | "FULL"
/**
* Allowed column orderings
*/
export type SortOrder = "ASCENDING" | "DESCENDING"
/**
* Supported aggregation operations
*/
export type ColumnAggretator = "SUM" | "COUNT" | "AVG" | "MAX" | "MIN"
/**
* A join expression
*/
export type JoinExpression<
Type extends string = JoinType,
From extends TableReference | NamedQuery = TableReference | NamedQuery,
On extends LogicalExpression = LogicalExpression
> = {
type: "JoinClause"
joinType: Type
from: From
on: On
}
/**
* Selected columns can be references or aggregates
*/
export type SelectedColumn = ColumnAggregate | ColumnReference
/**
* The set of selected columns that will be returned
*/
export type SelectColumns = {
[key: string]: SelectedColumn
}
/**
* An aggregation on a column (ex: COUNT(id) AS `count`)
*/
export type ColumnAggregate<
Column extends ColumnReference = ColumnReference,
Aggretator extends string = ColumnAggretator,
Alias extends string = string
> = {
type: "ColumnAggregate"
column: Column
aggregator: Aggretator
alias: Alias
}
/**
* A column ordering expression
*/
export type ColumnOrdering<
Column extends ColumnReference = ColumnReference,
Order extends string = SortOrder
> = {
type: "ColumnOrdering"
column: Column
order: Order
}
/**
* Structure for a select clause
*/
export type SelectClause<
Columns extends SelectColumns | "*" = SelectColumns | "*",
From extends TableReference | AnyNamedQuery = TableReference | AnyNamedQuery
> = {
type: "SelectClause"
columns: Columns
from: From
distinct?: true
}
/**
* A join clause
*/
export type JoinClause<
Join extends OneOrMore<JoinExpression> = OneOrMore<JoinExpression>
> = {
join: Join
}
/**
* Structure for a limit clause
*/
export type LimitClause<
Offset extends number = number,
Limit extends number = number
> = {
offset: Offset
limit: Limit
}
/**
* Structure for a group by clause
*/
export type GroupByClause<
GroupBy extends OneOrMore<ColumnReference> = OneOrMore<ColumnReference>
> = {
groupBy: GroupBy
}
/**
* Structure for an order by clause
*/
export type OrderByClause<
OrderBy extends OneOrMore<ColumnOrdering> = OneOrMore<ColumnOrdering>
> = {
orderBy: OrderBy
}
/**
* Structure for a having clause
*/
export type HavingClause<Having extends LogicalExpression = LogicalExpression> =
{
having: Having
}
There are a few things worth noting in this file, starting with the any
es-lint filtering that is at the top. As the comment suggests, this is a requirement to get things working due to the circular nature of a QueryClause
in our AST. A clause can be one of many types and our FROM clauses can have a subquery instead of a table, which can then have it’s own SELECT to fetch data creating a circular reference without the any
termination.
Secondly we can see that several of the additional clauses (Having, Limit, etc.) don’t have a type
field of their own and this is intentional. These are all optional modifications to the base SELECT clause that we could have them as optional parameters on the SelectClause
definition itself. However some things like WhereClauses
are valid on different types and we have to include a LOT more type information for the compiler to verify Select<Columns, From, Where, Having, Limit…>
. To get around this we can simply union the types together and then use some utility functions in the future to pull out the parts if they exist..
Lastly, you’ll notice that we use a lot of string unions for our join types or column aggregation methods. This goes back to the original goal for this library of allow extension while supporting a bunch of the normal use cases out of the box. If we restricted to ONLY use the joins we provided, any new joins would have to be put into that type in our library before someone could use them. This way it’s possible to create an extended AST even though it might be a bit of work for whoever wants to expand upon our solution.
For the rest of the AST code, it all looks pretty similar with defaults for features we will definitely support but are in theory extendable in the future (looking at YOU where clauses…). You’ll also note a few other places where we have to allow the any
keyword to inform the compiler that we don’t want to it explore the infinite recursion we’re setting up as possible. It’s something we might revisit later if it becomes a huge problem.
Next Steps
At this point, we have some simple types defined, an AST that should cover a lot of immediate work for us as well as some goals for what our library should be able to do. We also took a look at a few features in TypeScript that we can leverage (conditional/recursive types, generics, inference) as well as some things we’ll have to watch out for (infinite recursion, type narrowing, any
being used as a “trust us” switch).
Moving on from here, we’ll take a look at how we can define a simple database schema since nothing in our AST will do any sort of validation, it’s just there to capture some query intent regardless of whether or not the operations are possible. Once we have some concept of schema to go along with our AST, we can start working on query building and parsing which can fully leverage the type system and see how far we can push things.
I hope you enjoyed this introduction to the series and a few of the concepts we’ll be trying to build out and I’ll see you soon for part 2!