ToyC语言 本项目用于学习编译原理。
将参照龙书版本的《编译原理》,
以及LLVM的编译器制作教程:https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl01.html
制作一个完整的编译器前端。
编译器的结构
图1:摘自《Compilers Principles, Techniques & Tools》第二版Figure 2.3
UML
图2:UML
词法分析器(Lexer) 设计Token Token主要分为: 多字符保留字、标识符、数字以及其余单个字符。
多字符保留字:
控制流语句:if, else, do, while, break
布尔运算:true, false, &&, ||
比较运算:>=, <=, ==, !=
变量类型:int, float, bool, char
标识符:
正则表达式:[_a-zA-Z][_a-zA-Z0-9]*
数字:
整型正则表达式:[0-9]+
浮点型正则表达式:[0-9]+.[0-9]*
其余单个字字符
识别算法
测试 输入:字符串 输出:按序输出token流,每个token占一行
词法分析器的步骤可用如下伪代码表示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Input: Output: cache = ' ' Scan() { do { if cache != 空格、换行、制表符 break ; }while (read(cache)); t = readKey(cache) t = readNumber(cache); t = return t; }
语法分析器(Parser) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 PROGRAM -> BLOCK BLOCK -> '{' DECLS STMTS '}' DECLS -> DECLS DECL -> eps DECL -> TYPE id';' TYPE -> basic DIMS DIMS -> '[' DIMS']' -> eps STMTS -> STMTS STMT -> eps STMT -> ASSIGN';' -> if ( BOOL ) STMT -> if ( BOOL ) STMT else STMT -> while ( BOOL ) STMT -> do STMT while ( BOOL ) -> break ';' -> BLOCK ASSIGN -> id OFFSET = BOOL OFFSET -> [ BOOL ] OFFSET -> eps BOOL -> BOOL "||" JOIN -> JOIN JOIN -> JOIN "&&" EQAULITY -> EQUALITY EQUALITY-> EQUALITY "==" CMP -> EQUALITY "!=" CMP -> CMP CMP -> EXPR < EXPR -> EXPR <= EXPR -> EXPR >= EXPR -> EXPR > EXPR -> EXPR EXPR -> EXPR + TERM -> EXPR - TERM -> TERM TERM -> TERM * UNARY -> TERM / UNARY -> UNARY UNARY -> '!' UNARY -> '-' UNARY -> FACTOR FACTOR -> ( BOOL ) -> id OFFSET -> number -> real -> true -> false
符号表(Symbol Table)
图2:符号表示意图
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 PROGRAM -> {top = null;} BLOCK BLOCK -> '{' { saved = top; top = new Scope (top); } DECLS DECL { top = saved; } '}' DECLS -> DECLS DECL -> eps DECL -> TYPE id';' { s = new Symbol (id); s.type = TYPE.lexeme; top.put (id.lexeme, s); } TYPE -> basic {DIMS.type = basic; } DIMS {TYPE.lexeme = DIMS.type; } DIMS -> '[' num']' DIMS { Array.sz = num * Array.sz; DIMS.type = Array; } -> eps { Array.sz = 1 ; Array.type = Dims.type; } STMTS -> STMTS STMT -> eps STMT -> BLOCK STMT -> .... > ... id { s = top.get (id.lexeme); } ....
单元测试:
1 2 3 4 5 6 7 8 9 10 11 12 13 PROGRAM -> BLOCK BLOCK -> '{' DECLS STMTS '}' DECLS -> DECLS DECL -> eps DECL -> TYPE id ';' TYPE -> basic DIMS DIMS -> '[' num']' DIMS -> eps STMTS -> STMTS STMT -> eps STMT -> BLOCK -> FACTOR ';' FACTOR -> id
表达式的计算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 PROGRAM -> STMTS STMTS -> STMTS STMT -> eps STMT -> ASSIGN';' ASSIGN -> id OFFSET = BOOL OFFSET -> [ BOOL ] OFFSET -> eps BOOL -> BOOL "||" JOIN -> JOIN JOIN -> JOIN "&&" EQAULITY -> EQUALITY EQUALITY-> EQUALITY "==" CMP -> EQUALITY "!=" CMP -> CMP CMP -> EXPR < EXPR -> EXPR <= EXPR -> EXPR >= EXPR -> EXPR > EXPR EXPR -> EXPR + TERM -> EXPR - TERM -> TERM TERM -> TERM * UNARY -> TERM / UNARY -> UNARY UNARY -> '!' UNARY -> '-' UNARY -> FACTOR FACTOR -> ( BOOL ) -> OFFSET -> number -> real -> true -> false -> id OFFSET
输入
1 2 3 4 5 6 7 8 9 a = b + 1; c = a + 32 * 43 - b / ( g - 2); d = e + f; c = -e[3]; c[33+34+sd3*c] = de*c + c2; f[c[2*d]+4] = df + de[23-s[f]]; s[m][n][o] = -d[3][x]; a = !a || s && (c || d) || !f && kk ; b = (a + c) > (b * 2 - 1) || a < b && c;
输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 a = b + 1 L3: t1 = 32 * 43 t2 = a + t1 t3 = g - 2 t4 = b / t3 c = t2 - t4 L4: d = e + f L5: t5 = 3 * 10 t6 = e[t5] c = -t6 L6: t7 = 33 + 34 t8 = sd3 * c t9 = t7 + t8 t10 = t9 * 10 t11 = c[t10] t12 = de * c t13 = t12 + c2 c[t11] = t13 L7: t14 = 2 * d t15 = t14 * 10 t16 = c[t15] t17 = t16 + 4 t18 = t17 * 10 t19 = f[t18] t20 = f * 10 t21 = s[t20] t22 = 23 - t21 t23 = t22 * 10 t24 = de[t23] t25 = df + t24 f[t19] = t25 L8: t26 = m * 10 t27 = n * 10 t28 = t26 + t27 t29 = o * 10 t30 = t28 + t29 t31 = s[t30] t32 = 3 * 10 t33 = x * 10 t34 = t32 + t33 t35 = d[t34] t36 = -t35 s[t31] = t36 L9: if False a goto L13 t37 = 23 * 10 t38 = f * 10 t39 = t37 + t38 t40 = sp[t39] if False t40 goto L14 if c goto L13 if d goto L13 goto L14 L14: if f goto L11 if kk goto L13 goto L11 L13: t41 = true goto L12 L11: t41 = false L12: a = t41 L10: t42 = a + c t43 = b * 2 t44 = t43 - 1 if t42 > t44 goto L17 if False a < b goto L15 if c goto L17 goto L15 L17: t45 = true goto L16 L15: t45 = false L16: b = t45 L2:
控制流语句的中间代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 PROGRAM -> BLOCK BLOCK -> '{' STMTS '}' STMTS -> STMTS STMT -> eps STMT -> ASSIGN';' -> if ( BOOL ) STMT -> if ( BOOL ) STMT else STMT -> while ( BOOL ) STMT -> do STMT while ( BOOL ) -> break ';' -> BLOCK ASSIGN -> id OFFSET = BOOL OFFSET -> [ BOOL ] -> eps BOOL -> BOOL "||" JOIN -> JOIN JOIN -> JOIN "&&" EQAULITY -> EQUALITY EQUALITY-> EQUALITY "==" CMP -> EQUALITY "!=" CMP -> CMP CMP -> EXPR < EXPR -> EXPR <= EXPR -> EXPR >= EXPR -> EXPR > EXPR
输入
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 { if ((a + c) > (b * 2 - 1) || a < b && c){ a = b * 2 - c + (b + 2 * d); } while (a > 3){ a = a + 1; do a = a + 3; while( b > 2); if (c - 3){ b = c + 5; d = a + 3; } else { c = 3; break; } } while (c < b){ a = 3; } }
输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 L1: t1 = a + c t2 = b * 2 t3 = t2 - 1 if t1 > t3 goto L5 if False a < b goto L3 if c goto L5 goto L3L5: L4: t4 = b * 2 t5 = t4 - c t6 = 2 * d t7 = b + t6 a = t5 + t7L3: if False a > 3 goto L6L7: a = a + 1 L8: a = a + 3 L10: if b > 2 goto L8 goto L9L9: if False c - 3 goto L12L11: b = c + 5 L13: d = a + 3 goto L3L12: c = 3 L14: goto L6 goto L3L6: if False c < b goto L2L15: a = 3 goto L6L2:
总结 附录