Expand description
Unified CQL lexer (tokenizer) with grammar-aware position tracking.
Provides a single shared tokenizer that powers syntax highlighting (colorizer), tab completion (completer), and statement parsing (parser). Replaces three ad-hoc implementations with one consistent CQL understanding.
Design: hand-written state machine, O(n) single pass, no dependencies.
See docs/plans/18-cql-lexer.md for motivation and design decisions.
StructsΒ§
- Token
- A token produced by the CQL lexer.
EnumsΒ§
- Grammar
Context - Grammar context: what syntactic position weβre at, used to distinguish keywords from identifiers and to drive tab completion.
- Token
Kind - Classification of a CQL token.
ConstantsΒ§
- COLUMN_
LIST_ πKEYWORDS - Keywords that remain keywords inside a SELECT column list. These are clause-level keywords that terminate or modify the column list.
- CQL_
KEYWORDS π - Set of CQL keywords and shell commands (uppercase, sorted for binary search).
FunctionsΒ§
- advance_
context_ πafter_ name - Advance context after a name (quoted identifier, etc.).
- advance_
context_ πafter_ value - Advance context after a value (string literal, number, etc.).
- advance_
context_ πafter_ word - Advance the grammar context after seeing a word token.
- char_
len_ πat - Get the UTF-8 byte length of the char starting at position
i. - classify_
word π - Classify a word as keyword, boolean, or identifier based on grammar context.
- context_
from_ tokens - Derive grammar context from a token sequence (skipping whitespace/comments).
- grammar_
context_ at_ end - Get the grammar context at the end of the given input. Useful for tab completion to know what kind of token is expected next.
- has_
keyword_ πbefore - Check if any of the given keywords appear earlier in the significant token list.
- is_
column_ πlist_ keyword - is_
cql_ keyword - Check if a word is a CQL keyword (case-insensitive).
- is_
number_ πsign_ position - Determine if a β-β is in a position where it could be a negative number sign (after operator, punctuation, or at start).
- is_
operator_ πchar - is_
punctuation π - is_
strict_ πidentifier_ context - Contexts where the next word is always an identifier (a name), regardless of whether it matches a keyword. E.g., after FROM the next word is a table name.
- is_
two_ πchar_ operator - looks_
like_ πuuid - Check if the text from
starttonum_endfollowed by β-β looks like the beginning of a UUID (8 hex digits). - scan_
uuid π - Scan a UUID pattern: 8-4-4-4-12 hex digits with dashes.
Returns the end position if valid, or
startif not. - significant_
tokens - Extract only the significant (non-whitespace, non-comment) tokens.
- strip_
comments - Strip comments from CQL input, replacing block comments with a space and removing line comments (preserving newlines).
- tokenize
- Tokenize a CQL input string into a sequence of tokens with grammar context.