PyLex Analyzer
A small lexer that reads C-style source, tokenizes it with PLY, and prints each lexeme with a stable numeric group for quick inspection.
Snapshot
| Field | Details |
|---|---|
| Type | Command-line lexical analyzer |
| Context | Academic Prototype |
| Role | Solo developer |
| Year | 2022 |
| Status | Completed prototype |
| Main focus | Token rules, C-like keyword handling, grouped token output |
Overview
PyLex Analyzer is a Python tool I built to practice the first stage of compilation: turning source text into a stream of typed tokens. It targets C-like syntax (delimiters, operators, identifiers, keywords, literals, and comments) and prints each token in a fixed-width table so I could see what the lexer was producing while testing sample programs.
The project is a learning prototype, not a full compiler. Parsing and code generation are out of scope. I keep it in my portfolio because it shows how I modeled token classes, integrated a mature lexer library, and built a simple CLI around repeatable examples.
The problem
Before building a parser or interpreter, I needed a reliable way to break source code into meaningful pieces and verify that edge cases (keywords vs identifiers, floats vs integers, comments) were handled consistently.
- Raw text is hard to debug without a structured token view
- Hand-rolling a lexer from scratch is error-prone for operator and string patterns
- Course-style C samples needed a repeatable way to run the same rules against different files
- I wanted numeric groups so related lexeme types could be summarized at a glance
Who it was for
- Me, while learning compiler front-end concepts
- Anyone experimenting with lexing C-like snippets in Python
- Students or developers who want a minimal, readable token dump from sample
.cppfiles
My role
I designed and implemented the full flow end to end: token rules, keyword lists, grouping logic, file selection, console input mode, example programs, and formatted output. I chose PLY for the lexer engine and wrote the project-specific rules and helpers on top of it.
What the project does
The tool loads C-like source (from bundled examples or pasted input), runs it through a PLY-backed lexer, and prints each token with a group id, internal type name, and value. It also prints a reference table that maps group numbers to lexeme categories.
- Tokenizes delimiters, operators, identifiers, keywords, strings, chars, integers, and reals
- Discards or skips comments, preprocessor-style lines, and whitespace
- Promotes known words from identifiers to keywords using a fixed keyword set
- Maps each PLY token type to a numeric group for summary output
- Offers interactive selection among sample programs or direct stdin input
Key features
C-like token rules
Regular expressions and small token functions cover the constructs I cared about for classroom-style C++ samples, including compound operators and both block and line comments.
def t_IDENTIFIER(t):
r'[a-zA-Z_]+[a-zA-Z0-9_]*'
if t.value in keyword:
t.type = 'KEYWORD'
return tI kept keyword detection inside the identifier rule so reserved words never show up as generic identifiers in the output.
Numeric token groups
Each lexeme category maps to a stable group number used in the printed stream and summary table.
def token_group(tok):
group = 0
if tok == tokens[0]: #DELIMITER
group = group_number[0]
elif tok == tokens[1]: #OPERATOR
group = group_number[1]
# ...
return groupThe grouping layer sits above PLY so the console view stays compact even when the underlying token type names are verbose.
Example-driven and interactive modes
main.py lists files under test/examples/ and tokenizes the chosen program. console.py accepts pasted source for quick one-off checks without creating a file.
Readable token table output
The main loop formats token group, type, and value in aligned columns and appends a list of group ids for the whole file, which made regression checks on sample programs straightforward.
Technical approach
The architecture is intentionally flat: tokrules.py defines MyLexer() and returns a PLY lexer instance; main.py and console.py feed input and print results; constants.py holds keywords, token names, and group numbers; helpers.py handles file picking and grouping.
Scanning is powered by PLY (Python Lex-Yacc) by David M. Beazley (Dabeaz LLC). The project includes the vendored ply/ package (lex.py and yacc.py) under PLY’s license terms. I use ply.lex to build and run the lexer; ply.yacc is present as part of PLY but is not integrated into this prototype’s pipeline yet. PLY brings the classic lex/yacc workflow to Python; my custom work is the token rules, keyword table, grouping scheme, and CLI.
def MyLexer():
# ... token rules ...
return lex.lex()I wrapped rule definitions in a factory function so PLY builds the lexer from a clean local namespace each time.
while True:
tok = lexer.token()
if not tok:
break
tok_group = token_group(tok.type)
data_tokens.append(tok_group)
print('{:>5} | {:<10} | {:<64}'.format(tok_group, tok.type, tok.value))The consumer loop stays dumb on purpose: all classification complexity lives in tokrules.py and helpers.py.
Error handling for illegal characters prints the offending character and skips one position, which keeps exploratory runs going without crashing on a single bad symbol.
Design decisions
I optimized for clarity and inspection speed over completeness. The CLI is Spanish-language for prompts because that matched how I was working with the examples at the time.
- Used PLY instead of a from-scratch lexer generator to focus on token design and output, not scanner tables
- Collapsed many lexeme kinds into eight groups with numeric ids for easier scanning of long token lists
- Bundled small C++ samples (hello world, division, variable sizing) as predictable fixtures
- Separated
console.pyfrom file-basedmain.pyso pasted snippets did not require temp files - Left parsing (yacc/grammar) out of scope so the prototype stayed focused on lexical analysis only
Challenges and tradeoffs
- Operator and delimiter regexes are dense; overlapping patterns required careful ordering in PLY rule definitions
- Treating
printfandmainas keywords is useful for demos but not a full C++ keyword standard - The vendored PLY tree adds weight; the tradeoff was zero pip dependency friction and offline use
- No automated test suite in the repo; validation was manual via example files and console runs
- Without a parser pass, the tool cannot validate syntax, only surface-level tokenization
What I learned
Building PyLex Analyzer made the lexer stage tangible: I saw how regex rules, keyword tables, and library-generated scanners fit together before any grammar work.
- Regular expressions need discipline; small mistakes show up immediately in token streams
- Separating library lexing from project-specific grouping keeps output policies flexible
- Vendoring a well-known tool like PLY is a practical way to learn without reimplementing lex tables
- Giving credit to upstream authors (David M. Beazley for PLY) is part of responsible use of embedded libraries
Current status
This project is no longer maintained and should be read as a completed learning prototype from 2022. I keep it in my portfolio because it documents early work on compiler front ends, third-party library integration, and CLI tooling around token inspection.
If I revisited this today
- Add a minimal test harness that asserts token sequences for each example file
- Wire up
ply.yaccwith a small grammar only if the goal expands beyond lexing - Align the keyword list with a real C/C++ standard or generate it from a spec
- Support reading arbitrary file paths from the command line, not only bundled examples
- Emit JSON or LSP-friendly output for tooling instead of only formatted console tables