Module cql_lexer

Expand description

Unified CQL lexer (tokenizer) with grammar-aware position tracking.

Provides a single shared tokenizer that powers syntax highlighting (colorizer), tab completion (completer), and statement parsing (parser). Replaces three ad-hoc implementations with one consistent CQL understanding.

Design: hand-written state machine, O(n) single pass, no dependencies. See docs/plans/18-cql-lexer.md for motivation and design decisions.

Structs§

Token: A token produced by the CQL lexer.

Enums§

GrammarContext: Grammar context: what syntactic position we’re at, used to distinguish keywords from identifiers and to drive tab completion.
TokenKind: Classification of a CQL token.

Constants§

COLUMN_LIST_KEYWORDS 🔒: Keywords that remain keywords inside a SELECT column list. These are clause-level keywords that terminate or modify the column list.
CQL_KEYWORDS 🔒: Set of CQL keywords and shell commands (uppercase, sorted for binary search).

Functions§

advance_context_after_name 🔒: Advance context after a name (quoted identifier, etc.).
advance_context_after_value 🔒: Advance context after a value (string literal, number, etc.).
advance_context_after_word 🔒: Advance the grammar context after seeing a word token.
char_len_at 🔒: Get the UTF-8 byte length of the char starting at position i.
classify_word 🔒: Classify a word as keyword, boolean, or identifier based on grammar context.
context_from_tokens: Derive grammar context from a token sequence (skipping whitespace/comments). input_len is the total length of the original input, used to detect trailing whitespace.
find_keyword_before_table 🔒: Check if any of the given keywords appear earlier in the significant token list. Find the keyword immediately before a table reference at the end of the token stream. Handles unqualified (FROM table) and qualified (FROM ks.table) patterns.
grammar_context_at_end: Get the grammar context at the end of the given input. Useful for tab completion to know what kind of token is expected next.
has_keyword_before 🔒
is_column_list_keyword 🔒
is_cql_keyword: Check if a word is a CQL keyword (case-insensitive).
is_number_sign_position 🔒: Determine if a ‘-’ is in a position where it could be a negative number sign (after operator, punctuation, or at start).
is_operator_char 🔒
is_punctuation 🔒
is_strict_identifier_context 🔒: Contexts where the next word is always an identifier (a name), regardless of whether it matches a keyword. E.g., after FROM the next word is a table name.
is_two_char_operator 🔒
looks_like_uuid 🔒: Check if the text from start to num_end followed by ‘-’ looks like the beginning of a UUID (8 hex digits).
scan_uuid 🔒: Scan a UUID pattern: 8-4-4-4-12 hex digits with dashes. Returns the end position if valid, or start if not.
significant_tokens: Extract only the significant (non-whitespace, non-comment) tokens.
strip_comments: Strip comments from CQL input, replacing block comments with a space and removing line comments (preserving newlines).
tokenize: Tokenize a CQL input string into a sequence of tokens with grammar context.

Module cql_lexer

Module cql_lexer Copy item path

Structs§

Enums§

Constants§

Functions§

Module cql_lexer