# Roast (Regal's Optimized AST)

Roast is an optimized JSON format for [Rego](https://www.openpolicyagent.org/docs/policy-language) ASTs, as well as common utilities for working with both the Roast format and OPA's AST APIs.

Roast is used by [Regal](https://www.openpolicyagent.org/projects/regal), where the JSON representation of Rego's AST is used input for static analysis [performed by Rego itself](https://web.archive.org/web/https://www.styra.com/blog/linting-rego-with-rego/) to determine whether policies conform to Regal's linter rules.

> \[!IMPORTANT\] This library changes frequently and provides no guarantees for what its API looks like. Depending on this library directly is not recommended! If you find any code here useful, feel free to use it as your own in your projects, but be aware that it may change at any time, and you may be better off copy-pasting whatever code you need.

## Goals

*   Fast to traverse and process in Rego
*   Usable without having to deal with quirks and inconsistencies
*   As easy to read as the original AST JSON format

While this module provides a way to encode an `ast.Module` to an optimized JSON format, it does not provide a decoder. In other words, there's currently no way to turn optimized AST JSON back into an `ast.Module` (or other AST types). While this would be possible to do, there's no real need for that given our current use-case for this format, which is to help work with the AST efficiently in Rego. Roast should not be considered a general purpose format for serializing the Rego AST.

## Differences

The following section outlines the differences between the original AST JSON format and the Roast format.

### Compact `location` format

The perhaps most visually apparent change to the AST JSON format is how `location` attributes are represented. These attributes are **everywhere** in the AST, so optimizing these for fast traversal has a huge impact on both the size of the format and the speed at which it can be processed.

In the original AST, a `location` is represented as an object:

```
{  "file": "p.rego",  "row": 5,  "col": 1,  "text": "Y29sbGVjdGlvbg=="}
```

And in the optimized format as a string:

```
"5:1:5:11"
```

The first two numbers are present in both formats, i.e. `row` and `col`. In the optimized format, the third and fourth number is the end location, determined by the length of the `text` attribute decoded. In this case `Y29sbGVjdGlvbg==` decodes to `collection`, which is 10 characters long. The end location is therefore `5:11`. The text can later be retrieved when needed using the original source document as a lookup table of sorts.

While this may come with a small cost for when the `location` is actually needed, it's a huge win for when it's not. Having to `split` the result and parse the row and column values when needed occurs some overhead, but only a small percentage of `location` attributes are commonly used in practice.

Note that the `file` attribute is omitted entirely in the optimized format, as this would otherwise have to be repeated for each `location` value. This can easily be retrieved by other means.

### "Empty" rule and `else` bodies

Rego rules don't necessarily have a body, or at least not one that's printed. Examples of this include:

```
package policydefault rule := "value"map["key"] := "value"collection contains "value"
```

OPA represents such rules internally (that is, in the AST) as having a body with a single expression containing the boolean value `true`. This creates a uniform way to represent rules, so a rule like:

```
collection contains "value"
```

Would in the AST be identical to:

```
collection contains "value" if {    true}
```

And in the OPA JSON AST format:

```
{  "body": [    {      "index": 0,      "location": {        "file": "p.rego",        "row": 5,        "col": 1,        "text": "Y29sbGVjdGlvbg=="      },      "terms": {        "location": {          "file": "p.rego",          "row": 5,          "col": 1,          "text": "Y29sbGVjdGlvbg=="        },        "type": "boolean",        "value": true      }    }  ],  "head": {    "name": "collection",    "key": {      "location": {        "file": "p.rego",        "row": 5,        "col": 21,        "text": "InZhbHVlIg=="      },      "type": "string",      "value": "value"    },    "ref": [      {        "location": {          "file": "p.rego",          "row": 5,          "col": 1,          "text": "Y29sbGVjdGlvbg=="        },        "type": "var",        "value": "collection"      }    ],    "location": {      "file": "p.rego",      "row": 5,      "col": 1,      "text": "Y29sbGVjdGlvbiBjb250YWlucyAidmFsdWUi"    }  },  "location": {    "file": "p.rego",    "row": 5,    "col": 1,    "text": "Y29sbGVjdGlvbg=="  }}
```

Notice how there's 20 lines of JSON just to represent the body, even though there isn't really one!

The optimized Rego AST format discards generated bodies entirely, and the same rule would be represented as:

```
{  "head": {    "location": "5:1:5:11",    "ref": [      {        "location": "5:1:5:11",        "type": "var",        "value": "collection"      }    ],    "key": {      "type": "string",      "value": "value",      "location": "5:21:5:27"    }  },  "location": "5:1:5:11"}
```

Note that this applies equally to empty `else` bodies, which are represented the same way in the original AST, and omitted entirely in the optimized format.

Similarly, Roast discards `location` attributes from attributes that don't have an actual location in the source code. An example of this is the `data` term of a package path, which is present only in the AST.

### Removed `annotations` attribute from module

OPA already attaches `annotations` to rules. With the Roast format attaching `package` and `subpackages` scoped `annotations` to the `package` as well, there is no need to store `annotations` at the module level, as that's effectively just duplicating data. Having this removed can save a considerable amount of space in well-documented policies, as they should be!

### Removed `index` attribute from body expressions

In the original AST, each expression in a body carries a numeric `index` attribute. While this doesn't take much space, it is largely redundant, as the same number can be inferred from the order of the expressions in the body array. It's therefore been removed from the Roast format.

### Removed `name` attribute from rule heads

The `name` attribute found in the OPA AST for `rules` is unreliable, as it's not always present. The `ref` attribute however always is. While this doesn't come with any real cost in terms of AST size or performance, consistency is key.

### Added `interpolated` boolean attribute to template string expression nodes

Expressions found in template strings (like `$"{upper(input.name)}"`) are normal expressions as far as OPA is concerned. While that may be true, some linter rules targeted at "normal" expressions aren't applicable for expressions found in template strings. Take for example the [unassigned-return-value](https://www.openpolicyagent.org/projects/regal/rules/bugs/unassigned-return-value) rule, which would normally consider an expression like `upper(input.name)` to be a bug, as the return value of the `upper` call never is assigned. This is not the case for expressions found in template strings, as the output of their expressions is used to build the interpolated string. In order to avoid traversing the module tree twice — once for regular expressions, and once for expressions found in template strings — the Roast format instead marks expressions template strings with an `interpolated` boolean attribute set to `true`. This attribute is only present for expressions found in template strings, and is thus either `true` or missing, never `false`.

### Template string `multi_line` attribute only present when `true`

Following the Roast convention of omitting boolean attributes that are `false`, the `multi_line` attribute found on template string nodes is only present when `true`.

### Module comments represented as arrays of location strings

Since a comment is simply text at a given location, they are simply represented as location strings. OPA's AST format adds a redundant extra text atttribute only to strip off the leading `#` from the comment text, and is additionally inconsistent with the rest if its format, as both the `text` and `location` attributes are title-cased.

## Performance

While the numbers may vary some, the Roast format is currently about 40-50% smaller in size than the original AST JSON format, and can be processed (in Rego, using `walk` and so on) about 1.25 times faster.