Module cql_lexer

Module cql_lexer 

Source
Expand description

Unified CQL lexer (tokenizer) with grammar-aware position tracking.

Provides a single shared tokenizer that powers syntax highlighting (colorizer), tab completion (completer), and statement parsing (parser). Replaces three ad-hoc implementations with one consistent CQL understanding.

Design: hand-written state machine, O(n) single pass, no dependencies. See docs/plans/18-cql-lexer.md for motivation and design decisions.

StructsΒ§

Token
A token produced by the CQL lexer.

EnumsΒ§

GrammarContext
Grammar context: what syntactic position we’re at, used to distinguish keywords from identifiers and to drive tab completion.
TokenKind
Classification of a CQL token.

ConstantsΒ§

COLUMN_LIST_KEYWORDS πŸ”’
Keywords that remain keywords inside a SELECT column list. These are clause-level keywords that terminate or modify the column list.
CQL_KEYWORDS πŸ”’
Set of CQL keywords and shell commands (uppercase, sorted for binary search).

FunctionsΒ§

advance_context_after_name πŸ”’
Advance context after a name (quoted identifier, etc.).
advance_context_after_value πŸ”’
Advance context after a value (string literal, number, etc.).
advance_context_after_word πŸ”’
Advance the grammar context after seeing a word token.
char_len_at πŸ”’
Get the UTF-8 byte length of the char starting at position i.
classify_word πŸ”’
Classify a word as keyword, boolean, or identifier based on grammar context.
context_from_tokens
Derive grammar context from a token sequence (skipping whitespace/comments).
grammar_context_at_end
Get the grammar context at the end of the given input. Useful for tab completion to know what kind of token is expected next.
has_keyword_before πŸ”’
Check if any of the given keywords appear earlier in the significant token list.
is_column_list_keyword πŸ”’
is_cql_keyword
Check if a word is a CQL keyword (case-insensitive).
is_number_sign_position πŸ”’
Determine if a β€˜-’ is in a position where it could be a negative number sign (after operator, punctuation, or at start).
is_operator_char πŸ”’
is_punctuation πŸ”’
is_strict_identifier_context πŸ”’
Contexts where the next word is always an identifier (a name), regardless of whether it matches a keyword. E.g., after FROM the next word is a table name.
is_two_char_operator πŸ”’
looks_like_uuid πŸ”’
Check if the text from start to num_end followed by β€˜-’ looks like the beginning of a UUID (8 hex digits).
scan_uuid πŸ”’
Scan a UUID pattern: 8-4-4-4-12 hex digits with dashes. Returns the end position if valid, or start if not.
significant_tokens
Extract only the significant (non-whitespace, non-comment) tokens.
strip_comments
Strip comments from CQL input, replacing block comments with a space and removing line comments (preserving newlines).
tokenize
Tokenize a CQL input string into a sequence of tokens with grammar context.