diff --git a/docs/semgrep-code/editor.md b/docs/semgrep-code/editor.md index a59006022b..4bc0006b1d 100644 --- a/docs/semgrep-code/editor.md +++ b/docs/semgrep-code/editor.md @@ -152,8 +152,8 @@ To write a rule in advanced mode: - **Semgrep Assistant**: use Semgrep Assistant to [generate custom rules](/semgrep-assistant/customize#write-custom-rules-beta) - **Metavariable-comparison**: demonstrates how to use [the `metavariable-comparison` key](/writing-rules/rule-syntax/#metavariable-comparison) - **Metavariable-pattern**: demonstrates how to use [the `metavariable-pattern` key](/writing-rules/rule-syntax/#metavariable-pattern) - - **Dataflow analysis**: demonstrates how to leverage dataflow analysis through [`pattern-sources`](/writing-rules/data-flow/taint-mode/#sources), [`pattern-sinks`](/writing-rules/data-flow/taint-mode/#sinks), and [`pattern-sanitizers`](/writing-rules/data-flow/taint-mode/#sanitizers). - - **Dataflow analysis with taint labels**: demonstrates [how to define the sources you want to track and how data must flow](/writing-rules/data-flow/taint-mode/#taint-labels-pro-) + - **Dataflow analysis**: demonstrates how to leverage dataflow analysis through [`pattern-sources`](/writing-rules/data-flow/taint-mode/overview#sources), [`pattern-sinks`](/writing-rules/data-flow/taint-mode/overview#sinks), and [`pattern-sanitizers`](/writing-rules/data-flow/taint-mode/overview#sanitizers). + - **Dataflow analysis with taint labels**: demonstrates [how to define the sources you want to track and how data must flow](/writing-rules/data-flow/taint-mode/advanced#taint-labels-) - **HTTP validators**: Demonstrates how to write [Semgrep Secrets rules](/semgrep-secrets/rules/) that include [validators](/semgrep-secrets/validators/) 2. Modify the template, adding and changing the keys and values needed to finish your rule. 3. Optional: Click **Metadata** to update and enter additional metadata fields. diff --git a/docs/semgrep-code/semgrep-pro-engine-examples.md b/docs/semgrep-code/semgrep-pro-engine-examples.md index 3fb52a4745..cc3aa585f6 100644 --- a/docs/semgrep-code/semgrep-pro-engine-examples.md +++ b/docs/semgrep-code/semgrep-pro-engine-examples.md @@ -31,7 +31,7 @@ The following resources can help you test the code in the sections below. As you ## Taint tracking -Semgrep CE allows you to search for the flow of any potentially exploitable input into an important sink using taint mode. For more information, see the [taint mode](/writing-rules/data-flow/taint-mode) documentation. +Semgrep CE allows you to search for the flow of any potentially exploitable input into an important sink using taint mode. For more information, see the [taint mode](/writing-rules/data-flow/taint-mode/overview) documentation. In the examples below, see a comparison of Semgrep and Semgrep CE while searching for dangerous calls using data obtained `get_user_input` call. The rule does this by specifying the source of taint as `get_user_input(...)` and the sink as `dangerous(...);`. diff --git a/docs/writing-rules/data-flow/data-flow-overview.md b/docs/writing-rules/data-flow/data-flow-overview.md index 2d518956e8..2eedabad88 100644 --- a/docs/writing-rules/data-flow/data-flow-overview.md +++ b/docs/writing-rules/data-flow/data-flow-overview.md @@ -12,10 +12,9 @@ import DataFlowStatus from "/src/components/concept/_data-flow-status.mdx" # Dataflow analysis engine overview -Semgrep provides an intraprocedural dataflow analysis engine that opens various up Semgrep capabilities, including: - -- [Constant propagation](/writing-rules/data-flow/constant-propagation), which allows Semgrep to, for example, match `return 42` against `return x` when `x` can be reduced to `42` by constant folding. There is also an experimental feature of [Constant propagation](/writing-rules/data-flow/constant-propagation), called [Symbolic propagation](/writing-rules/experiments/symbolic-propagation). -- [Taint tracking (also known as taint analysis)](/writing-rules/data-flow/taint-mode/), which enables you to write simple rules that catch complex [injection bugs](https://owasp.org/www-community/Injection_Flaws), such as those that can result in [cross-site scripting (XSS)](https://owasp.org/www-community/attacks/xss/). +Semgrep provides an intraprocedural data-flow analysis engine that opens various Semgrep capabilities. Semgrep provides the following data-flow analyses: +- [Constant propagation](/writing-rules/data-flow/constant-propagation) allows Semgrep to, for example, match `return 42` against `return x` when `x` can be reduced to `42` by constant folding. There is also a specific experimental feature of [Constant propagation](/writing-rules/data-flow/constant-propagation), called [Symbolic propagation](/writing-rules/experiments/symbolic-propagation). +- [Taint tracking (known also as taint analysis)](/writing-rules/data-flow/taint-mode/overview) enables you to write simple rules that catch complex [injection bugs](https://owasp.org/www-community/Injection_Flaws), such as those that can result in [cross-site scripting (XSS)](https://owasp.org/www-community/attacks/xss/). All dataflow-related features are available for Semgrep's [supported languages](/supported-languages). Interfile (cross-file) analysis also supports dataflow analysis. For more details, see [ Perform cross-file analysis](/semgrep-code/semgrep-pro-engine-intro). diff --git a/docs/writing-rules/data-flow/taint-mode.md b/docs/writing-rules/data-flow/taint-mode.md deleted file mode 100644 index 86d16ef09a..0000000000 --- a/docs/writing-rules/data-flow/taint-mode.md +++ /dev/null @@ -1,749 +0,0 @@ ---- -slug: taint-mode -append_help_link: true -tags: - - Rule writing -description: >- - Taint mode allows you to write simple rules that catch complex injection bugs thanks to taint analysis. ---- - -# Taint analysis - -Semgrep supports [taint analysis](https://en.wikipedia.org/wiki/Taint_checking), or taint tracking, through taint rules, which are specified by adding `mode: taint` to your rule. - -Taint analysis is a dataflow analysis that tracks the flow of untrusted, or **tainted** data throughout the body of a function or method. Tainted data originates from tainted **sources**. If tainted data is not transformed or checked accordingly (sanitized), taint analysis reports a finding whenever tainted data reaches a vulnerable function, called a **sink**. Tainted data flow from sources to sinks through **propagators**, such as assignments or function calls. - -The following video provides a quick overview of taint mode: - - -## Getting started - -Taint tracking rules must specify `mode: taint`, which enables the following operators: - -- `pattern-sources` (required) -- `pattern-sinks` (required) -- `pattern-propagators` (optional) -- `pattern-sanitizers` (optional) - -These operators (which act as `pattern-either` operators) take a list of patterns that specify what is considered a source, a sink, a propagator, or a sanitizer. Note that you can use **any** pattern operator, and you have the same expressive power as a `mode: search` rule. - -For example: - - - -Here, Semgrep tracks the data returned by `get_user_input()`, which is the source of taint. Think of Semgrep running the pattern `get_user_input(...)` on your code, finding all places where `get_user_input` gets called, and labeling them as tainted. That is exactly what is happening under the hood! - -The rule specifies the sanitizer `sanitize_input(...)`, so any expression that matches that pattern is considered sanitized. In particular, the expression `sanitize_input(data)` is labeled as sanitized. Even if `data` is tainted, it does not produce any findings as it occurs inside a piece of sanitized code. - -Finally, the rule specifies that anything matching either `html_output(...)` or `eval(...)` should be regarded as a sink. There are two calls in `html_output(data)` that are both labeled as sinks. The first one in `route1` is not reported because `data` is sanitized before reaching the sink, whereas the second one in `route2` is reported because the `data` that reaches the sink is still tainted. - -You can find more examples of taint rules in the [Semgrep Registry](https://semgrep.dev/r?owasp=injection%2Cxss), such as [express-sandbox-code-injection](https://semgrep.dev/editor?registry=javascript.express.security.express-sandbox-injection.express-sandbox-code-injection). - -:::info -[Metavariables](/writing-rules/pattern-syntax#metavariables) used in `pattern-sources` are considered to be _different_ from those used in `pattern-sinks`, even if the metavariables have the same name! See [Metavariables, rule message, and unification](#metavariables-rule-message-and-unification) for further details. -::: - -## Sources - -A taint source is specified by a pattern. Like a search-mode rule, you can start this pattern with one of the following keys: `pattern`, `patterns`, `pattern-either`, `pattern-regex`. Note that **any** subexpression that is matched by this pattern is regarded as a source of taint. - -In addition, taint sources accept the following options: - -| Option | Type | Default | Description | -| :-----------------|:------------------------- | :------ | :--------------------------------------------------------------------- | -| `exact` | {`false`, `true`} | `false` | See [_Exact sources_](#exact-sources). | -| `by-side-effect` | {`false`, `true`, `only`} | `false` | See [_Sources by side-effect_](#sources-by-side-effect). | -| `control` (Pro) 🧪 | {`false`, `true`} | `false` | See [_Control sources_](#control-sources-pro-). | - -Example: - -```yaml -pattern-sources: -- pattern: source(...) -``` - -### Exact sources - -Given the source specification below and a piece of code, such as `source(sink(x))`, the call `sink(x)` is reported as a tainted sink. - -```yaml -pattern-sources: -- pattern: source(...) -``` - -The reason is that the pattern `source(...)` matches all of `source(sink(x))`, and that makes Semgrep consider every subexpression in that piece of code as being a source. In particular, `x` is a source, and it is being passed into `sink`! - - - -It is possible to instruct Semgrep to only consider as taint sources the "exact" matches of a source pattern by setting `exact: true`: - -```yaml -pattern-sources: -- pattern: source(...) - exact: true -``` - -Once the source is "exact," Semgrep no longer considers subexpressions as taint sources, and `sink(x)` inside `source(sink(x))` is not reported as a tainted sink (unless `x` is tainted in some other way). - - - -For many rules, this distinction is not very meaningful, because it does not always make sense that a sink occurs inside the arguments of a source function. - -:::note -If one of your rules relies on non-exact matching of sources, make it explicit with `exact: false`, even if it is the current default, so that your rule does not break if the default changes. -::: - -### Sources by side-effect - -Consider the following Python code, where `make_tainted` is a function that makes its argument tainted by side-effect: - -```python -make_tainted(my_set) -sink(my_set) -``` - -This kind of source can be specified by setting `by-side-effect: true`: - -```yaml -pattern-sources: - - patterns: - - pattern: make_tainted($X) - - focus-metavariable: $X - by-side-effect: true -``` - -When this option is enabled, and the source specification matches a variable (or in general, an [l-value](https://en.wikipedia.org/wiki/Value_(computer_science)#lrvalue)) exactly, then Semgrep assumes that the variable (or l-value) becomes tainted by side-effect at the precise places where the source specification produces a match. - - - -The matched occurrences themselves are considered tainted; that is, the occurrence of `x` in `make_tainted(x)` is itself tainted too. If you do not want this to be the case, then set `by-side-effect: only` instead. - -:::note -You must use `focus-metavariable: $X` to focus the match on the l-value that you want to taint; otherwise, `by-side-effect` does not work. -::: - -If the source does not set `by-side-effect`, then only the very occurrence of `x` in `make_tainted(x)` is tainted, but not the occurrence of `x` in `sink(x)`. The source specification matches only the first occurrence, and without `by-side-effect: true`, Semgrep does not know that `make_tainted` is updating the variable `x` by side-effect. Thus, a taint rule using such a specification does not produce any finding. - -:::info -You could be tempted to write a source specification as the following example (and this was the official workaround before `by-side-effect`): - -```yaml -pattern-sources: -- patterns: - - pattern-inside: | - make_tainted($X) - ... - - pattern: $X -``` - -This tells Semgrep that **every** occurrence of `$X` after `make_tainted($X)` must be considered a source. - -This approach has two main limitations. First, it overrides any sanitization that can be performed on the code matched by `$X`. In the example code below, the call `sink(x)` is reported as tainted despite `x` having been sanitized! - -```python -make_tainted(x) -x = sanitize(x) -sink(x) # false positive -``` - -Note also that [`...` ellipses operator](/writing-rules/pattern-syntax/#ellipses-and-statement-blocks) has limitations. For example, in the code below, Semgrep does not match any finding if such a source specification is in use: - -```python -if cond: - make_tainted(x) -sink(x) # false negative -``` - -The `by-side-effect` option was added precisely [to address those limitations](https://semgrep.dev/playground/s/JDv4y). However, that kind of workaround can still be useful in other situations! -::: - -### Function arguments as sources - -To specify that an argument of a function must be considered a taint source, simply write a pattern that matches that argument: - -```yaml -pattern-sources: - - patterns: - - pattern-inside: | - def foo($X, ...): - ... - - focus-metavariable: $X -``` - -Note that the use of `focus-metavariable: $X` is very important, and using `pattern: $X` is **not** equivalent. With `focus-metavariable: $X`, Semgrep matches the formal parameter exactly. Click "Open in Playground" below and use "Inspect Rule" to visualize what the source is matching. - - - -The following example does the same with this other taint rule that uses `pattern: $X`. The `pattern: $X` does not match the formal parameter itself, but matches all its uses inside the function definition. Even if `x` is sanitized via `x = sanitize(x)`, the occurrence of `x` inside `sink(x)` is a taint source itself (due to `pattern: $X`) and so `sink(x)` is tainted! - - - -### Control sources (Pro) 🧪 - -**Control taint sources is a Semgrep Pro feature.** - -Typically, taint analysis tracks the flow of tainted _data_, but taint sources can also track the flow of tainted _control_ by setting `control: true`. - -```yaml -pattern-sources: -- pattern: source(...) - control: true -``` - -This is useful for checking _reachability_, that is, to check if from a given code location the control-flow can reach another code location, regardless of whether there is any flow of data between them. In the following example, there is a check for whether `foo()` could be followed by `bar()`: - - - -By using a control source, you can define a context from which Semgrep detects if a call to some other code, such as a sink, can be reached. - -:::note -Use [taint labels](#taint-labels-pro-) to combine both data and control sources in the same rule. -::: - -## Sanitizers - -A taint sanitizer is specified by a pattern. Like in a search-mode rule, you can start this pattern with one of the following keys: `pattern`, `patterns`, `pattern-either`, `pattern-regex`. Note that **any** subexpression that is matched by this pattern is regarded as sanitized. - -In addition, taint sanitizers accept the following options: - -| Option | Type | Default | Description | -| :-----------------|:------------------------- | :------ | :--------------------------------------------------------------------- | -| `exact` | {`false`, `true`} | `false` | See [_Exact sanitizers_](#exact-sanitizers). | -| `by-side-effect` | {`false`, `true`, `only`} | `false` | See [_Sanitizers by side-effect_](#sanitizers-by-side-effect). | - -Example: - -```yaml -pattern-sanitizers: -- pattern: sanitize(...) -``` - -### Exact sanitizers - -Given the sanitizer specification below, and a piece of code such as `sanitize(sink("taint"))`, the call `sink("taint")` is **not** reported. - -```yaml -pattern-sanitizers: -- pattern: sanitize(...) -``` - -The reason is that the pattern `sanitize(...)` matches all of `sanitize(sink("taint"))`, and that makes Semgrep consider every subexpression in that piece of code as being sanitized. In particular, `"taint"` is considered to be sanitized! - - - -This is the default for historical reasons, but it may change in the future. - -It is possible to instruct Semgrep to only consider as sanitized the "exact" matches of a sanitizer pattern by setting `exact: true`: - -```yaml -pattern-sanitizers: -- pattern: sanitize(...) - exact: true -``` - -Once the source is "exact," Semgrep no longer considers subexpressions as sanitized, and `sink("taint")` inside `sanitize(sink("taint"))` is reported as a tainted sink. - - - -For many rules, this distinction is not very meaningful because it does not always make sense that a sink occurs inside the arguments of a sanitizer function. - -:::note -If one of your rules relies on non-exact matching of sanitizers, make it explicit with `exact: false`, even if it is the current default, so that your rule does not break if the default changes. -::: - -### Sanitizers by side-effect - -Consider the following hypothetical Python code, where it is guaranteed that after `check_if_safe(x)`, the value of `x` must be a safe one. - -```python -x = source() -check_if_safe(x) -sink(x) -``` - -This kind of sanitizer can be specified by setting `by-side-effect: true`: - -```yaml -pattern-sanitizers: - - patterns: - - pattern: check_if_safe($X) - - focus-metavariable: $X - by-side-effect: true -``` -When this option is enabled, and the sanitizer specification matches a variable (or in general, an l-value) exactly, then Semgrep assumes that the variable (or l-value) is sanitized by side-effect at the precise places where the sanitizer specification produces a match. - - - -:::note -It is important to use `focus-metavariable: $X` to focus the match on the l-value that to be sanitized, otherwise `by-side-effect` does not work as expected. -::: - -If the sanitizer does not set `by-side-efect`, then only the very occurrence of `x` in `check_if_safe(x)` is sanitized, but not the occurrence of `x` in `sink(x)`. The sanitizer specification matches only the first occurrence, and without `by-side-effect: true`, Semgrep does not know that `check_if_safe` is updating/sanitizing the variable `x` by side-effect. Thus, a taint rule using such a specification does produce a finding for `sink(x)` in the preceeding. - -:::info -You can be tempted to write a sanitizer specification as the one below (and this was the official workaround before `by-side-effect`): - -```yaml -pattern-sanitizers: -- patterns: - - pattern-inside: | - check_if_safe($X) - ... - - pattern: $X -``` - -This tells Semgrep that **every** occurrence of `$X` after `check_if_safe($X)` must be considered sanitized. - -This approach has two main limitations. First, it overrides any further tainting that can be performed on the code matched by `$X`. In the example code below, the call `sink(x)` is **not** reported as tainted despite `x` having been tainted! - -```python -check_if_safe(x) -x = source() -sink(x) # false negative -``` - -Note also that [`...` ellipses operator](/writing-rules/pattern-syntax/#ellipses-and-statement-blocks) has limitations. For example, in the code below, Semgrep still matches despite `x` having been sanitized in both branches: - -```python -if cond: - check_if_safe(x) -else - check_if_safe(x) -sink(x) # false positive -``` - -The `by-side-effect` option was added precisely [to address those limitations](https://semgrep.dev/playground/s/PeB3W). However, that kind of workaround can still be useful in other situations! -::: - -## Sinks - -A taint sink is specified by a pattern. Like in a search-mode rule, you can start this pattern with one of the following keys: `pattern`, `patterns`, `pattern-either`, `pattern-regex`. Unlike sources and sanitizers, by default, Semgrep does not consider the subexpressions of the matched expressions as sinks. - -In addition, taint sinks accept the following options: - -| Option | Type | Default | Description | -| :---------| :-----------------| :------ | :--------------------------------------------------------------------- | -| `exact` | {`false`, `true`} | `true` | See [_Non-exact sinks_](#non-exact-sinks). | -| `at-exit` (Pro) 🧪 | {`false`, `true`} | `false` | See [_At-exit sinks_](#at-exit-sinks-pro-). | - -Example: - -```yaml -pattern-sinks: -- pattern: sink(...) -``` - -### Non-exact sinks - -Given the sink specification below, a piece of code such as `sink("foo" if tainted else "bar")` is **not** be reported as a tainted sink. - -```yaml -pattern-sources: -- pattern: sink(...) -``` - -This is because Semgrep considers that the sink is the argument of the `sink` function, and the actual argument being passed is `"foo" if tainted else "bar"` that evaluates to either `"foo"` or `"bar"`, and neither of them is tainted. - - - -It is possible to instruct Semgrep to consider as a taint sink any of the subexpressions matching the sink pattern, by setting `exact: false`: - -```yaml -pattern-sinks: -- pattern: sink(...) - exact: false -``` - -Once the sink is "non-exact," Semgrep considers subexpressions as taint sinks, and `tainted` inside `sink("foo" if tainted else "bar")` is then be reported as a tainted sink. - - - -### Function arguments as sinks - -You can specify that only one (or a subset) of the arguments of a function is the actual sink by using `focus-metavariable`: - -```javascript -pattern-sinks: - - patterns: - - pattern: sink($SINK, ...) - - focus-metavariable: $SINK -``` - -This rule causes Semgrep to only annotate the first parameter passed to `sink` as the sink, rather than the function `sink` itself. If taint goes into any other parameter of `sink`, then that is not considered a problem. - - - -Anything that you can match with Semgrep can be made into a sink, like the index in an array access: - -```javascript -pattern-sinks: - - patterns: - - pattern-inside: $ARRAY[$SINK] - - focus-metavariable: $SINK -``` - -:::note -If you specify a sink such as `sink(...)`, then any tainted data passed to `sink`, through any of its arguments, results in a finding. - - -::: - -### At-exit sinks (Pro) 🧪 - -**At-exit taint sinks is a Semgrep Pro feature.** - -At-exit sinks are meant to facilitate writing leak-detection rules using taint mode. By setting `at-exit: true`, you can restrict a sink specification to only match at "exit" statements, that is, statements after which the control-flow exits the function being analyzed. - -``` -pattern-sinks: -- pattern-either: - - pattern: return ... - - pattern: $F(...) - at-exit: true -``` - -The preceeding sink pattern matches either `return` statements (which are always "exit" statements), or function calls occurring as "exit" statements. - -Unlike regular sinks, at-exit sinks trigger a finding if any tainted l-value reaches the location of the sink. For example, the preceeding at-exit sink specification triggers a finding at a `return 0` statement if some tainted l-value reaches the `return`, even if `return 0` itself is not tainted. The location itself is the sink rather than the code that is at that location. - -You can use this, for example, to check that file descriptors are being closed within the same function where they were opened. - - - -The `print(content)` statement is reported because the control flow exits the function at that point, and the file has not been closed. - -## Propagators (Pro) - -**Custom taint propagators is a Semgrep Pro feature.** - -By default, tainted data automatically propagates through assignments, operators, and function calls (from inputs to outputs). However, there are other ways in which taint can propagate, which can require language or library-specific knowledge that Semgrep does not have built in. - -A taint propagator requires a pattern to be specified. Like in a search-mode rule, you can start this pattern with one of the following keys: `pattern`, `patterns`, `pattern-either`, `pattern-regex`. - -A propagator also needs to specify the origin (`from`) and the destination (`to`) of the taint to be propagated. - -| Field | Type | Description | -| :----------|:------------------------- | :--------------------------------------------------------------------- | -| `from` | metavariable | Source of propagation. | -| `to` | metavariable | Destination of propagation. | - -In addition, taint propagators accept the following options: - -| Option | Type | Default | Description | -| :-----------------|:------------------------- | :------ | :--------------------------------------------------------------------- | -| `by-side-effect` | {`false`, `true`} | `true` | See [_Propagation without side-effect_](#propagation-without-side-effect). | - -For example, given the following propagator, if taint goes into the second argument of `strcpy`, its first argument gets the same taint: - -```yaml -pattern-propagators: -- pattern: strcpy($DST, $SRC) - from: $SRC - to: $DST -``` - -:::info -Taint propagators only work intra-procedurally, that is, within a function or method. You cannot use taint propagators to propagate taint across different functions/methods. Use [inter-procedural analysis](#inter-procedural-analysis-pro). -::: - -### Understanding custom propagators - -Consider the following Python code where an unsafe `user_input` is stored in a `set` data structure. A random element from `set` is then passed into a `sink` function. This random element can be `user_input` itself, leading to an injection vulnerability! - -```python -def test(s): - x = user_input - s = set([]) - s.add(x) - #ruleid: test - sink(s.pop()) -``` - -The following rule cannot find the above-described issue. The reason is that Semgrep is not aware that executing `s.add(x)` makes `x` one of the elements in the set data structure `s`. - -```yaml -mode: taint -pattern-sources: -- pattern: user_input -pattern-sinks: -- pattern: sink(...) -``` - -The use of **taint propagators** enables Semgrep to propagate taint in this and other scenarios. -Taint propagators are specified under the `pattern-propagators` key: - -```yaml -pattern-propagators: -- pattern: $S.add($E) - from: $E - to: $S -``` - -In the preceeding example, Semgrep finds the pattern `$S.add($E)`, and it checks whether the code matched by `$E` is tainted. If it is tainted, Semgrep propagates that same taint to the code matched by `$S`. Thus, adding tainted data to a set marks the set itself as tainted. - - - -Note that `s` becomes tainted _by side-effect_ after `s.add(x)`, this is due to `by-side-effect: true` being the default for propagators, and because `s` is an l-value. - -In general, a taint propagator must specify: -1. A pattern containing **two** metavariables. These two metavariables specify where taint is propagated **from** and **to**. -2. The `to` and `from` metavariables. These metavariables should match an **expression**. - - The `from` metavariable specifies the entry point of the taint. - - The `to` metavariable specifies where the tainted data is propagated to, typically an object or data structure. If option `by-side-effect` is enabled (as it is by default) and the `to` metavariable matches an l-value, the propagation is side-effectful. - -In the preceeding example, pattern `$S.add($E)` includes two metavariables `$S` and `$E`. Given `from: $E` and `to: $S`, and with `$E` matching `x` and `$S` matching `s`, when `x` is tainted, then `s` becomes tainted (by side-effect) with the same taint as `x`. - -Another situation where taint propagators can be useful is to specify in Java that, when iterating a collection that is tainted, the individual elements must also be considered tainted: - -```yaml -pattern-propagators: -- pattern: $C.forEach(($X) -> ...) - from: $C - to: $X -``` - -### Propagation without side-effect - -Taint propagators can be used in very imaginative ways, and in some cases, you may not want taint to propagate by side-effect. This can be achieved by disabling `by-side-effect`, which is enabled by default. - -For example: - -```yaml -pattern-propagators: - - patterns: - - pattern: | - if something($FROM): - ... - $TO() - ... - from: $FROM - to: $TO - by-side-effect: false -``` - -The preceeding propagator specifies that, inside an `if` block, where the condition is `something($FROM)`, you can propagate taint from `$FROM` to any function that is being called without arguments, `$TO()`. - - - -Because the rule disables `by-side-effect`, the `sink` occurrence that is inside the `if` block is tainted, but this does not affect the `sink` occurrence outside the `if` block. - -## Findings - -Taint findings are accompanied by a taint trace that explains how the taint flows from source to sink. - - - -### Deduplication of findings - -Semgrep tracks all the possible ways that taint can reach a sink, but at present, it only reports one taint trace among the possible ones. Click "Open in Playground" in the example below, run the example to get one finding, and then ask the Playground to visualize the dataflow of the finding. Even though `sink` can be tainted via `x` or via `y`, the trace only shows you one of these possibilities. If you replace `x = user_input` with `x = "safe"`, then Semgrep reports the taint trace through `y`. - - - -### Report findings on the sources (Pro) - -**Reporting findings on the source of taint is a Semgrep Pro feature.** - -By default, Semgrep reports taint findings at the location of the sink being matched. You must look at the taint trace to identify where the taint is coming from. It is also possible to make Semgrep report the findings at the location of the taint sources by setting the [rule-level option](/writing-rules/rule-syntax/#options) `taint_focus_on` to `source`. Then - -```yaml -options: - taint_focus_on: source -``` - - - -The [deduplication of findings](#deduplication-of-findings) still applies in this case. While Semgrep now reports all the taint sources, if a taint source can reach multiple sinks, the taint trace only informs you about one of them. - -## Minimizing false positives - -The following [rule options](/writing-rules/rule-syntax/#options) can be used to minimize false positives: - -| Rule option | Default | Description | -| :-----------------------------| :------ | :--------------------------------------------------------------------- | -| `taint_assume_safe_booleans` | `false` | Boolean data is never considered tainted (works better with type annotations). | -| `taint_assume_safe_numbers` | `false` | Numbers (integers, floats) are never considered tainted (works better with type annotations). | -| `taint_assume_safe_indexes` | `false` | An index expression `I` tainted does not make an access expression `E[I]` tainted (it is only tainted if `E` is tainted). | -| `taint_assume_safe_functions` | `false` | A function call like `F(E)` is not considered tainted even if `E` is tainted. (When using Pro's [inter-procedural taint analysis](#inter-procedural-analysis-pro), this only applies to functions for which Semgrep cannot find a definition.) | -| `taint_only_propagate_through_assignments` 🧪 | `false` | Disables all implicit taint propagation except for assignments. | - -### Restrict taint by type (Pro) - -By enabling `taint_assume_safe_booleans`, Semgrep automatically sanitizes Boolean expressions when it can infer that the expression resolves to a Boolean. - -For example, a tainted string compared against a constant string is not considered a tainted expression: - - - -Similarly, enabling `taint_assume_safe_numbers` Semgrep automatically sanitizes numeric expressions when it can infer that the expression is numeric. - - - -You could define explicit sanitizers that clean the taint from Boolean or numeric expressions, but these options are more convenient and also more efficient. - -:::note -Semgrep Pro's ability to infer types for expressions varies depending on the language. For example, in Python, type annotations are not always present, and the `+` operator can also be used to concatenate strings. Semgrep also ignores the types of functions and classes coming from third-party libraries. - - -::: - -### Assume tainted indexes are safe - -By default, Semgrep assumes that accessing an array-like object with a tainted index (that is, `obj[tainted]`) is itself a tainted **expression**, even if the **object** itself is not tainted. Setting `taint_assume_safe_indexes: true` makes Semgrep assume that these expressions are safe. - - - -### Assume function calls are safe - -:::note -A function call is _opaque_ when Semgrep does not have access to its definition, to examine it and determine its "taint behavior" (for example, whether the function call propagates or not any taint that comes through its inputs). In Semgrep Community Edition (CE), where taint analysis is intraprocedural, all function calls are opaque. In Semgrep Pro, with [inter-procedural taint analysis](#inter-procedural-analysis-pro), an opaque function could be one coming from a third-party library. -::: - -By default, Semgrep considers that an _opaque_ function call propagates any taint passed through any of its arguments to its output. - -For example, in the code below, `some_safe_function` receives tainted data as input, so Semgrep assumes that it also returns tainted data as output. As a result, a finding is produced. - -```javascript -var x = some_safe_function(tainted); -sink(x); // undesired finding here -``` - -This can generate false positives, and for certain rules on certain codebases it can produce a high amount of noise. - -Setting `taint_assume_safe_functions: true` makes Semgrep assume that opaque function calls are safe and do not propagate any taint. If it is desired that specific functions propagate taint, then that can be achieved via custom propagators: - - - -### Propagate only through assignments 🧪 - -Setting `taint_only_propagate_through_assignments: true` makes Semgrep propagate taint only through trivial assignments of the form ` = `. It requires the user to be explicit about any other kind of taint propagation that is to be performed. - -For example, neither `unsafe_function(tainted)` nor `tainted_string + "foo"` are considered tainted expressions: - - - -## Metavariables, rule message, and unification - -The patterns specified by `pattern-sources` and `pattern-sinks` (and `pattern-sanitizers`) are all independent of each other. If a metavariable used in `pattern-sources` has the same name as a metavariable used in `pattern-sinks`, these are still different metavariables. - -In the message of a taint-mode rule, you can refer to any metavariable bound by `pattern-sinks`, as well as to any metavariable bound by `pattern-sources` that does not conflict with a metavariable bound by `pattern-sinks`. - -Semgrep can also treat metavariables with the same name as the _same_ metavariable, simply set `taint_unify_mvars: true` using rule `options`. Unification enforces that whatever a metavariable binds to in each of these operators is, syntactically speaking, the **same** piece of code. For example, if a metavariable binds to a code variable `x` in the source match, it must bind to the same code variable `x` in the sink match. In general, unless you know what you are doing, avoid metavariable unification between sources and sinks. - -The following example demonstrates the use of source and sink metavariable unification: - - - -## Inter-procedural analysis (Pro) - -**Inter-procedural taint analysis is a Semgrep Pro feature.** - -[Semgrep](/semgrep-pro-vs-oss/) can perform inter-procedural taint analysis, that is, to track taint across multiple functions. - -In the example below, `user_input` is passed to `foo` as input and, from there, flows to the sink at line 3, through a call chain involving three functions. Semgrep is able to track this and report the sink as tainted. Semgrep also provides an inter-procedural taint trace that explains how exactly `user_input` reaches the `sink(z)` statement (click "Open in Playground" then click "dataflow" in the "Matches" panel). - - - -Using the CLI option `--pro-intrafile`, Semgrep performs inter-procedural (across functions) _intra_-file (within one file) analysis. That is, it tracks taint across functions, but not cross file boundaries. This is supported for essentially every language, and performance is very close to that of intraprocedural taint analysis. - -Using the CLI option `--pro`, Semgrep performs inter-procedural (across functions) as well as *inter*-file (across files) analysis. Inter-file analysis is only supported for [a subset of languages](/supported-languages#language-maturity-summary). For a rule to run interfile, it also needs to set `interfile: true`: - -```yaml -options: - interfile: true -``` - -**Memory requirements for inter-file analysis:** -While interfile analysis is more powerful, it also demands more memory resources. The Semgrep team advises a minimum of 4 GB of memory per core, but **recommends 8 GB per core or more**. The amount of memory needed depends on the codebase and on the number of interfile rules being run. - -## Taint mode sensitivity - -### Field sensitivity - -The taint engine provides basic field sensitivity support. It can: - -- Track that `x.a.b` is tainted, but `x` or `x.a` is **not** tainted. If `x.a.b` is tainted, any extension of `x.a.b` (such as `x.a.b.c`) is considered tainted by default. -- Track that `x.a` is tainted, but remember that `x.a.b` has been sanitized. Thus, the engine records that `x.a.b` is **not** tainted, but `x.a` or `x.a.c` are still tainted. - -:::note -The taint engine does track taint **per variable** and not **per object in memory**. The taint engine does not track aliasing at present. -::: - - - -### Index sensitivity (Pro) - -**Index sensitivity is a Semgrep Pro feature.** - -Semgrep Pro has basic index sensitivity support: -- Only for accesses using the built-in `a[E]` syntax. -- Works for _statically constant_ indexes that may be either integers (for example, `a[42]`) or strings (for example, `a["foo"]`). -- If an arbitrary index `a[i]` is sanitized, then every index becomes clean of taint. - - - -## Taint labels (Pro) 🧪 - -Taint labels increase the expressiveness of taint analysis by allowing you to specify and track different kinds of tainted data in one rule using labels. This capability has various uses, for example, when data becomes dangerous in several steps that are hard to specify through a single pair of source and sink. - - - -To include taint labels in a taint mode rule, follow these steps: - -1. Attach a `label` key to the taint source. For example, `label: TAINTED` or `label: INPUT`. See the example below: - ```yaml - pattern-sources: - - pattern: user_input - label: INPUT - ``` - Semgrep accepts any valid Python identifier as a label. - -2. Restrict a taint source to a subset of labels using the `requires` key. Extending the previous example, see the `requires: INPUT` below: - ```yaml - pattern-sources: - - pattern: user_input - label: INPUT - - pattern: evil(...) - requires: INPUT - label: EVIL - ``` - Combine labels using the `requires` key. To combine labels, use Python Boolean operators. For example: `requires: LABEL1 and not LABEL2`. - -3. Use the `requires` key to restrict a taint sink in the same way as source: - ```yaml - pattern-sinks: - - pattern: sink(...) - requires: EVIL - ``` - -:::info -- Semgrep accepts valid Python identifiers as labels. -- Restrict a source to a subset of labels using the `requires` key. You can combine more labels in the `requires` key using Python Boolean operators. For example: `requires: LABEL1 and not LABEL2`. -- Restrict a sink also. The extra taint is only produced if the source itself is tainted and satisfies the `requires` formula. -::: - -In the example below, assume that `user_input` is dangerous but only when it passes through the `evil` function. This can be specified with taint labels as follows: - - - - - -### Multiple `requires` expressions in taint labels - -You can assign an independent `requires` expression to each metavariable matched by a sink. Given `$OBJ.foo($ARG)`, you can easily require that `$OBJ` has some label `XYZ` and `$ARG` has some label TAINTED, and at the same time `focus-metavariable: $ARG`: - -``` -pattern-sinks: - - patterns: - - pattern: $OBJ.foo($SINK, $ARG) - - focus-metavariable: $SINK - requires: - - $SINK: BAD - - $OBJ: AAA - - $ARG: BBB -``` diff --git a/docs/writing-rules/data-flow/taint-mode/advanced.md b/docs/writing-rules/data-flow/taint-mode/advanced.md new file mode 100644 index 0000000000..d56feef2f2 --- /dev/null +++ b/docs/writing-rules/data-flow/taint-mode/advanced.md @@ -0,0 +1,525 @@ +--- +slug: advanced +title: Advanced techniques for taint analysis +hide_title: true +description: Learn advanced techniques for taint mode, which allows you to write rules to catch complex injection bugs. +tags: + - Rule writing + - Dataflow analysis + - Taint analysis +--- + +# Advanced taint analysis techniques + +This page covers advanced taint analysis techniques for use when writing rules to catch complex injection bugs. If you are new to writing taint mode rules, begin with [Overview](/writing-rules/data-flow/taint-mode/overview). + +## Taint by side effect + +### Taint sources by side effect + +Consider the following Python code, where `make_tainted` is a function that makes its argument tainted by side effect: + +```python +make_tainted(my_set) +sink(my_set) +``` + +This kind of source can be specified by setting `by-side-effect: true`: + +```yaml +pattern-sources: + - patterns: + - pattern: make_tainted($X) + - focus-metavariable: $X + by-side-effect: true +``` + +When `by-side-effect: true` is enabled and the source specification matches a variable, or more generally, an [l-value](https://en.wikipedia.org/wiki/Value_(computer_science)#lrvalue) exactly, then Semgrep assumes that the variable, or l-value, becomes tainted by side effect at the places where the source specification produces a match. + + + +The matched occurrences themselves are considered tainted; that is, the occurrence of `x` in `make_tainted(x)` is itself tainted too. If you do not want this to be the case, then set `by-side-effect: only` instead. + +:::note +You must use `focus-metavariable: $X` to focus the match on the l-value that you want to taint; otherwise, `by-side-effect` does not work. +::: + +If the source doesn't set `by-side-effect`, then only the very occurrence of `x` in `make_tainted(x)` will be tainted, not the occurrence of `x` in `sink(x)`. The source specification matches only the first occurrence, and without `by-side-effect: true`, Semgrep does not recognize that `make_tainted` updates the variable `x` by side effect. Thus, a taint rule using such a specification does not produce any finding. + +
+Original implementation for tainting variables by side effect + +Before the implementation of `by-side-effect`, the following example was the official workaround to obtain similar behavior: + +```yaml +pattern-sources: +- patterns: + - pattern-inside: | + make_tainted($X) + ... + - pattern: $X +``` + +This definition says that **every** occurrence of `$X` after `make_tainted($X)` must be considered a source. However, this approach has two main limitations: + +1. It overrides any sanitization that can be performed on the code matched by `$X`. In the example code below, the call `sink(x)` is reported as tainted despite `x` having been sanitized! + + ```python + make_tainted(x) + x = sanitize(x) + sink(x) # false positive + ``` + +2. The [`...` ellipses operator](/writing-rules/pattern-syntax/#ellipses-and-statement-blocks) has limitations. For example, in the code below, Semgrep does not match any finding if such a source specification is in use: + + ```python + if cond: + make_tainted(x) + sink(x) # false negative + ``` +
+ +### Taint sanitizers by side-effect + +Consider the following Python code, where it is guaranteed that, after `check_if_safe(x)`, the value of `x` must be a safe one. + +```python +x = source() +check_if_safe(x) +sink(x) +``` + +This kind of sanitizer can be specified by setting `by-side-effect: true`: + +```yaml +pattern-sanitizers: + - patterns: + - pattern: check_if_safe($X) + - focus-metavariable: $X + by-side-effect: true +``` + +If you enable `by-side-effect` and the sanitizer specification matches a variable, or more generally, an l-value, exactly, Semgrep assumes that the variable or l-value is sanitized by side effect at the places where the sanitizer specification produces a match. + + + +If the sanitizer doesn't set by side effect, then only the very occurrence of `x` in `check_if_safe(x)` is sanitized and *not* the occurrence of `x` in `sink(x)`. The sanitizer specification matches only the first occurrence, and without `by-side-effect: true`, Semgrep doesn't know that `check_if_safe` updates and sanitizes the variable `x` by side effect. Thus, a taint rule using such a specification does produce a finding for `sink(x)` in the preceding example. + +:::note +Ensure that you use `focus-metavariable: $X` to focus the match on the l-value that you want to sanitize. Otherwise, `by-side-effect` does not work as expected. +::: + +
+Original implementation for tainting sanitizers by side effect + +Before the implementation of `by-side-effect`, the following example was the official workaround to obtain similar behavior: + +```yaml +pattern-sanitizers: +- patterns: + - pattern-inside: | + check_if_safe($X) + ... + - pattern: $X +``` + +This specification tells Semgrep that **every** occurrence of `$X` after `check_if_safe($X)` must be considered sanitized. + +This approach has two main limitations: + +1. It overrides any further tainting that can be performed on the code matched by `$X`. In the following example, the call `sink(x)` is **not** reported as tainted despite `x` having been tainted: + ```python + check_if_safe(x) + x = source() + sink(x) # false negative + ``` +2. The [`...` ellipses operator](/writing-rules/pattern-syntax/#ellipses-and-statement-blocks) has limitations. For example, in the following code, Semgrep still returns matches despite `x` having been sanitized in both branches: + ```python + if cond: + check_if_safe(x) + else + check_if_safe(x) + sink(x) # false positive + ``` + +
+ +## Taint function arguments + +### Taint function arguments as sources + +To specify that an argument of a function must be considered a taint source, you can write a pattern that matches the argument: + +```yaml +pattern-sources: + - patterns: + - pattern-inside: | + def foo($X, ...): + ... + - focus-metavariable: $X +``` + +Note that the use of `focus-metavariable: $X` is essential, and using `pattern: $X` is **not** equivalent. With `focus-metavariable: $X`, Semgrep matches the formal parameter exactly. Click "Open in Playground" below and use "Inspect Rule" to visualize what the source is matching. + + + +The subsequent example defines the same behavior with a taint rule that uses `pattern: $X`. The `pattern: $X` does not match the formal parameter itself, but matches all its uses inside the function definition. Even if `x` is sanitized via `x = sanitize(x)`, the occurrence of `x` inside `sink(x)` is a taint source itself (due to `pattern: $X`) and so `sink(x)` is tainted. + + + +### Taint function arguments as sinks + +You can specify that only one, or a subset, of the arguments of a function is the actual sink by using `focus-metavariable`: + +```javascript +pattern-sinks: + - patterns: + - pattern: sink($SINK, ...) + - focus-metavariable: $SINK +``` + + +This rule causes Semgrep only to annotate the first parameter passed to `sink` as the sink, rather than the function `sink` itself. If taint goes into any other parameter of `sink`, then that is not considered a problem. + + + +Anything that you can match with Semgrep can be made into a sink, such as the index in an array access: + +```javascript +pattern-sinks: + - patterns: + - pattern-inside: $ARRAY[$SINK] + - focus-metavariable: $SINK +``` + + +:::note +If you specify a sink such as `sink(...)`, then any tainted data passed to `sink`, through any of its arguments, results in a finding. + + +::: + +## Custom propagators + +To better understand custom propagators, consider the following Python code where an unsafe `user_input` is stored in a `set` data structure. A random element from `set` is then passed into a `sink` function. This random element can be `user_input` itself, leading to an injection vulnerability. + + +```python +def test(s): + x = user_input + s = set([]) + s.add(x) + #ruleid: test + sink(s.pop()) +``` + +The following rule cannot find the above-described issue. The reason is that Semgrep is not aware that executing `s.add(x)` makes `x` one of the elements in the set data structure `s`. + +```yaml +mode: taint +pattern-sources: +- pattern: user_input +pattern-sinks: +- pattern: sink(...) +``` + +The use of **taint propagators** enables Semgrep to propagate taint in this scenario and others. + +Taint propagators are specified under the `pattern-propagators` key: + +```yaml +pattern-propagators: +- pattern: $S.add($E) + from: $E + to: $S +``` + +In the preceding example, Semgrep finds the pattern `$S.add($E)`, and it checks whether the code matched by `$E` is tainted. If it is tainted, Semgrep propagates that same taint to the code matched by `$S`. Thus, adding tainted data to a set marks the set itself as tainted. + + + +Note that `s` becomes tainted _by side effect_ after `s.add(x)`. This is due to `by-side-effect: true` being the default for propagators, and because `s` is an l-value. + +In general, a taint propagator must specify the following requirements: + +1. A pattern containing **two** metavariables. These two metavariables specify where taint is propagated **from** and **to**. +2. The `to` and `from` metavariables. These metavariables must match an **expression**. + - The `from` metavariable specifies the entry point of the taint. + - The `to` metavariable specifies where the tainted data is propagated to, typically an object or data structure. If option `by-side-effect` is enabled (as it is by default) and the `to` metavariable matches an l-value, the propagation is side-effectful. + +In the preceding example, pattern `$S.add($E)` includes two metavariables `$S` and `$E`. Given `from: $E`, `to: $S`, `$E` matching `x`, and `$S` matching `s`, when `x` is tainted, then `s` becomes tainted by side-effect with the same taint as `x`. + +Another situation where taint propagators are useful is specifying in Java that, when iterating a collection that is tainted, the individual elements must also be considered tainted: + + +```yaml +pattern-propagators: +- pattern: $C.forEach(($X) -> ...) + from: $C + to: $X +``` + +### Propagate without side-effect + +Taint propagators can be used in many different ways, and in some cases, you might not want taint to propagate by side effect. You can avoid this behavior by disabling `by-side-effect`, which is enabled by default. + + +```yaml +pattern-propagators: + - patterns: + - pattern: | + if something($FROM): + ... + $TO() + ... + from: $FROM + to: $TO + by-side-effect: false +``` + +The preceding propagator definition specifies that inside an `if` block, where the condition is `something($FROM)`, we want to propagate taint from `$FROM` to any function that is being called without arguments, `$TO()`. + + + +Because the rule turns off `by-side-effect`, the `sink` occurrence that is inside the `if` block is tainted, but this does not affect the `sink` occurrence outside the `if` block. + +## Minimize false positives + +The following [rule options](/writing-rules/rule-syntax/#options) can be used to minimize false positives: + +| Rule option | Default | Description | +| - | - | - | +| `taint_assume_safe_booleans` | `false` | Boolean data is never considered tainted (works better with type annotations). | +| `taint_assume_safe_numbers` | `false` | Numbers (integers, floats) are never considered tainted (works better with type annotations). | +| `taint_assume_safe_indexes` | `false` | An index expression `I` tainted does not make an access expression `E[I]` tainted (it is only tainted if `E` is tainted). | +| `taint_assume_safe_functions` | `false` | A function call like `F(E)` is not considered tainted even if `E` is tainted. Note: When using Pro's [interprocedural taint analysis](/writing-rules/data-flow/taint-mode/overview#interprocedural-analysis-), this only applies to functions for which Semgrep cannot find a definition. | +| `taint_only_propagate_through_assignments` 🧪 | `false` | Disables all implicit taint propagation except for assignments. | + +### Restrict taint by type 🧪 + +Semgrep automatically sanitizes Boolean expressions when it can infer that the expression resolves to a Boolean if you enable the `taint_assume_safe_booleans` option. + +For example, comparing a tainted string against a constant string isn't considered a tainted expression: + + + +Similarly, by enabling `taint_assume_safe_numbers`, Semgrep automatically sanitizes numeric expressions when it can infer that the expression is numeric. + + + +You could define explicit sanitizers that clean the taint from Boolean or numeric expressions, but these options are more convenient and also more efficient. + +:::note +Semgrep Pro's ability to infer types for expressions varies depending on the language. For example, in Python, type annotations are not always present, and the `+` operator can also be used to concatenate strings. Semgrep also ignores the types of functions and classes coming from third-party libraries. + + +::: + + +### Assume tainted indexes are safe + +By default, Semgrep assumes that accessing an array-like object with a tainted index (that is, `obj[tainted]`) is itself a tainted **expression**, even if the **object** itself is not tainted. Setting `taint_assume_safe_indexes: true` makes Semgrep assume that these expressions are safe. + + + +### Assume function calls are safe + +:::note +A function call is referred to as _opaque_ when Semgrep doesn't have access to its definition, which is necessary to examine it and determine its taint behavior. For example, with an opaque function, Semgrep cannot determine whether a function call propagates any taint that comes through its inputs. + +In Semgrep Community Edition (CE), where taint analysis is intraprocedural, all function calls are opaque. In Semgrep Pro, with [interprocedural taint analysis](/writing-rules/data-flow/taint-mode/overview#interprocedural-analysis-), an opaque function could originate from a third-party library. +::: + +By default, Semgrep assumes that an _opaque_ function call propagates any taint passed through any of its arguments to its output. + +For example, in the following code snippet, `some_safe_function` receives tainted data as input, so Semgrep assumes that it also returns tainted data as output. As a result, a finding is produced. + + +```javascript +var x = some_safe_function(tainted); +sink(x); // undesired finding here +``` + +This rule can generate false positives, and in some cases, it produces a high level of noise. Setting `taint_assume_safe_functions: true` makes Semgrep assume that opaque function calls are safe and do not propagate any taint. If you'd like specific functions to propagate taint without generating a finding, you can do so using custom propagators: + + + +### Propagate only through assignments 🧪 + +Setting `taint_only_propagate_through_assignments: true` makes Semgrep propagate taint through trivial assignments of the form ` = ` only. It requires the user to be explicit about any other kind of taint propagation that is to be performed. + +For example, neither `unsafe_function(tainted)` nor `tainted_string + "foo"` will be considered tainted expressions: + + + +## Metavariables, rule messages, and unification + +The patterns specified by `pattern-sources` and `pattern-sinks` (and `pattern-sanitizers`) are all independent of each other. If a metavariable used in `pattern-sources` has the same name as a metavariable used in `pattern-sinks`, these are considered to be different metavariables. + +In the message of a taint-mode rule, you can refer to any metavariable bound by `pattern-sinks`, as well as to any metavariable bound by `pattern-sources` that does not conflict with a metavariable bound by `pattern-sinks`. + +Semgrep can also treat metavariables with the same name as the _same_ metavariable; to turn this behavior on, set `taint_unify_mvars: true` using rule `options`. Unification enforces the behavior where whatever a metavariable binds to in each of these operators is, syntactically speaking, the **same** piece of code. For example, if a metavariable binds to a code variable `x` in the source match, it must bind to the same code variable `x` in the sink match. In general, unless you know what you are doing, avoid metavariable unification between sources and sinks. + +The following example demonstrates the use of source and sink metavariable unification: + + + + +## Taint mode sensitivity + +### Field sensitivity + +The taint engine provides basic field sensitivity support. It can: +- Track that `x.a.b` is tainted, but `x` or `x.a` is **not** tainted. If `x.a.b` is tainted, any extension of `x.a.b` (such as `x.a.b.c`) is considered tainted by default. +- Track that `x.a` is tainted, but remember that `x.a.b` has been sanitized. Thus, the engine records that `x.a.b` is **not** tainted, but `x.a` or `x.a.c` are still tainted. + +:::note +The taint engine tracks taint **per variable**, *not* **per object in memory**. The taint engine does not track aliasing. +::: + + + +### Index sensitivity 🧪 + +:::note +Index sensitivity is a Semgrep Pro feature. +::: + +Semgrep Pro has basic index sensitivity support: + +- This feature is only for access using the built-in `a[E]` syntax. +- This feature works for _statically constant_ indexes that are integers, such as `a[42]` or strings, such as `a["foo"]`. +- If an arbitrary index `a[i]` is sanitized, then every index becomes clean of taint. + + + +## Report findings on the source 🧪 + +:::note +Reporting findings on the source of taint is a Semgrep Pro feature. +::: + +By default, Semgrep reports taint findings at the location of the sink being matched. You must examine the taint trace to identify the source of the taint. However, you can also have Semgrep report the findings at the location of the taint sources by setting the [rule-level option](/writing-rules/rule-syntax/#options) `taint_focus_on` to `source`: + +```yaml +options: + taint_focus_on: source +``` + + + +The [deduplication of findings](/writing-rules/data-flow/taint-mode/overview#deduplication-of-findings) still applies in this case. While Semgrep reports all the taint sources, the taint trace only informs you of one sink if a taint source can reach multiple sinks. + +## Restrict taint to at-exit sinks 🧪 + +:::note +At-exit taint sinks is a Semgrep Pro feature. +::: + +At-exit sinks are meant to facilitate writing leak-detection rules using taint mode. By setting `at-exit: true`, you can restrict a sink specification to only match at exit statements, or statements after which the control-flow will exit the function being analyzed. + +```yaml +pattern-sinks: +- pattern-either: + - pattern: return ... + - pattern: $F(...) + at-exit: true +``` + +The preceding sink pattern matches either `return` statements, which are always exit statements, or function calls occurring as exit statements. + +Unlike regular sinks, at-exit sinks trigger a finding if any tainted l-value reaches the location of the sink. For example, the preceding at-exit sink specification triggers a finding at a `return 0` statement if some tainted l-value reaches the `return`, even if `return 0` itself is not tainted. The location itself is the sink, rather than the code that is located there. + +You can use behavior, for example, to check that file descriptors are being closed within the same function where they were opened. + + + +The `print(content)` statement is reported because the control flow exits the function at that point, and the file has not been closed. + +## Track control sources 🧪 + +:::note +Control taint sources is a Semgrep Pro feature. +::: + +Typically, taint analysis tracks the flow of tainted _data_, but taint sources can also track the flow of tainted _control_ by setting `control: true`. + +```yaml +pattern-sources: +- pattern: source(...) + control: true +``` + +This is useful for checking reachability, that is, to determine if control flow from a given code location can reach another code location, regardless of whether there is any data flow between them. In the following example, SEmgrep checks whether `foo()` could be followed by `bar()`: + + + +By using a control source, you can define a context from which Semgrep detects if a call to some other code, such as a sink, can be reached. + +:::note +Use [taint labels](#taint-labels-) to combine both data and control sources in the same rule. +::: + +## Taint labels 🧪 + +Taint labels increase the expressiveness of taint analysis by allowing you to specify and track different kinds of tainted data in one rule using labels. This functionality is helpful for more complex use cases, such as when data becomes dangerous in several steps that are hard to specify through a single pair of source and sink. + + + +To include taint labels in a taint mode rule, follow these steps: + + +1. Attach a `label` key to the taint source, such as `label: TAINTED` or `label: INPUT`: + ```yaml + pattern-sources: + - pattern: user_input + label: INPUT + ``` + Semgrep accepts any valid Python identifier as a label. + +2. Restrict a taint source to a subset of labels using the `requires` key. The following sample extends the previous example with `requires: INPUT`: + ```yaml + pattern-sources: + - pattern: user_input + label: INPUT + - pattern: evil(...) + requires: INPUT + label: EVIL + ``` + Combine labels using the `requires` key. To do so, use Python's Boolean operators, such as `requires: LABEL1 and not LABEL2`. + +3. Use the `requires` key to restrict a taint sink in the same way as source: + ```yaml + pattern-sinks: + - pattern: sink(...) + requires: EVIL + ``` + The extra taint is only produced if the source itself is tainted and satisfies the `requires` formula. + +In the following example, assume that `user_input` is dangerous, but only when it passes through the `evil` function. This can be specified with taint labels as follows: + + + + + + +### Multiple `requires` expressions in taint labels + +You can assign an independent `requires` expression to each metavariable matched by a sink. Given `$OBJ.foo($ARG)`, you can require that `$OBJ` has label `XYZ` and `$ARG` has label TAINTED, and `focus-metavariable: $ARG`: + +``` +pattern-sinks: + - patterns: + - pattern: $OBJ.foo($SINK, $ARG) + - focus-metavariable: $SINK + requires: + - $SINK: BAD + - $OBJ: AAA + - $ARG: BBB +``` diff --git a/docs/writing-rules/data-flow/taint-mode/overview.md b/docs/writing-rules/data-flow/taint-mode/overview.md new file mode 100644 index 0000000000..db0b0d35da --- /dev/null +++ b/docs/writing-rules/data-flow/taint-mode/overview.md @@ -0,0 +1,300 @@ +--- +slug: overview +title: Taint analysis +hide_title: true +description: Learn about taint mode, which allows you to write rules that catch complex injection bugs using taint analysis. +tags: + - Rule writing + - Dataflow analysis + - Taint analysis +--- + +# Taint analysis overview + +Semgrep supports [taint analysis](https://en.wikipedia.org/wiki/Taint_checking), also known as taint tracking, through taint rules. Taint rules are specified by the inclusion of `mode: taint` in your rule. + +Taint analysis is a dataflow analysis that tracks the flow of untrusted, or **tainted**, data throughout the body of a function or method. Tainted data originates from tainted **sources**. If tainted data is not transformed or checked accordingly, or **sanitized**, taint analysis reports a finding whenever tainted data reaches a vulnerable function, called a **sink**. Tainted data flows from sources to sinks through **propagators**, such as assignments and function calls. + + + +## Create a rule + +To create a taint tracking rule, include `mode: taint` in the rule's YAML definition file. This enables the following operators: + +| Operator | Required? | +| - | - | +| `pattern-sources` | Yes | +| `pattern-propagators` | No | +| `pattern-sanitizers` | No | +| `pattern-sinks` | No | + +These operators, which act as `pattern-either` operators, take a list of patterns that specify what is considered a source, a propagator, a sanitizer, or a sink. + +> You can use **any** pattern operator and you have the same expressive power as you would with a `mode: search` rule. + +### Sample rule and pattern matching + + + +In the preceding example, Semgrep tracks the data returned by `get_user_input()`, which is the source of tainted data. You can think of what's happening as Semgrep running the pattern `get_user_input(...)` on your code, identifying all instances where `get_user_input` is called, and labeling them as tainted. + +The rule specifies the sanitizer `sanitize_input(...)`, so any expression that matches that pattern is considered sanitized. In particular, the expression `sanitize_input(data)` is labeled as sanitized. Even if `data` is tainted, as it occurs inside a piece of sanitized code, it does not produce any findings. + +Finally, the rule specifies that anything matching either `html_output(...)` or `eval(...)` should be regarded as a sink. There are two calls to `html_output(data)` that are both labeled as sinks. The first one in `route1` is not reported because `data` is sanitized before reaching the sink, whereas the second one in `route2` is reported because the `data` that reaches the sink is still tainted. + +Find more examples of taint rules in the [Semgrep Registry](https://semgrep.dev/r?owasp=injection%2Cxss), including [express-sandbox-code-injection](https://semgrep.dev/editor?registry=javascript.express.security.express-sandbox-injection.express-sandbox-code-injection). + +:::warning +[Metavariables](/writing-rules/pattern-syntax#metavariables) used in `pattern-sources` are considered _different_ from those used in `pattern-sinks`, even if they have the same name! See [Metavariables, rule message, and unification](/writing-rules/data-flow/taint-mode/advanced#metavariables-rule-messages-and-unification) for further details. +::: + +## Sources + +You can specify a taint source using a pattern. Like a search-mode rule, you can start this pattern with one of the following keys: + +- `pattern` +- `patterns` +- `pattern-either` +- `pattern-regex` + +Example: + +```yaml +pattern-sources: +- pattern: source(...) +``` + +**Any** subexpression that's matched by the pattern you define is regarded as a source of tainted data. + +Additionally, taint sources accept the following options: + +| Option | Type | Default | Description | +| - | - | - | - | +| `exact` | {`false`, `true`} | `false` | See [Exact sources](#exact-sources). | +| `by-side-effect` | {`false`, `true`, `only`} | `false` | See [Taint sources by side-effect](/writing-rules/data-flow/taint-mode/advanced#taint-sources-by-side-effect). | +| `control` (Pro) 🧪 | {`false`, `true`} | `false` | See [Track control sources](/writing-rules/data-flow/taint-mode/advanced#track-control-sources-). + +### Exact sources + +Given the subsequent source specification and a piece of code, such as `source(sink(x))`, the call `sink(x)` is reported as a tainted sink. + +```yaml +pattern-sources: +- pattern: source(...) +``` + +The reason is that the pattern `source(...)` matches all of `source(sink(x))`, and that makes Semgrep consider every subexpression in that piece of code as being a source. In particular, `x` is a source, and it is being passed into `sink`. + + + +You can instruct Semgrep to only consider as taint sources the "exact" matches of a source pattern by setting `exact: true`: + +```yaml +pattern-sources: +- pattern: source(...) + exact: true +``` + +Once the source is exact, Semgrep no longer considers subexpressions as taint sources, and `sink(x)` inside `source(sink(x))` isn't reported as a tainted sink, unless `x` is tainted in another way. + + + +For many rules, this distinction isn't meaningful because it doesn't always make sense that a sink occurs inside the arguments of a source function. + +> If one of your rules relies on non-exact matching of sources, make this fact explicit with `exact: false`, even if it is the current default, so that your rule doesn't break if you change the default. + +## Sanitizers + +You can specify a taint sanitizer using a pattern. Like a search-mode rule, you can start the pattern with any of the following keys: + +- `pattern` +- `patterns` +- `pattern-either` +- `pattern-regex` + +Example: + +```yaml +pattern-sanitizers: +- pattern: sanitize(...) +``` + +**Any** subexpression that is matched by this pattern is regarded as sanitized. + +Additionally, taint sanitizers accept the following options: + + +| Option | Type | Default | Description | +| - | - | - | - | +| `exact` | {`false`, `true`} | `false` | See [Exact sanitizers](#exact-sanitizers). | +|`by-side-effect` | {`false`, `true`, `only`} | `false` | See [Taint sanitizers by side-effect](/writing-rules/data-flow/taint-mode/advanced#taint-sanitizers-by-side-effect). | + +### Exact sanitizers + +Given the sanitizer specification that follows and a piece of code, such as `sanitize(sink("taint"))`, Semgrep doesn't report the call `sink("taint")`. + +```yaml +pattern-sanitizers: +- pattern: sanitize(...) +``` + +This is because the pattern `sanitize(...)` matches all of `sanitize(sink("taint"))`, and that makes Semgrep consider every subexpression in that piece of code as sanitized. In particular, `"taint"` is considered sanitized. + + + +You can instruct Semgrep only to consider the exact matches of a sanitizer pattern as sanitized by setting `exact: true`: + + +```yaml +pattern-sanitizers: +- pattern: sanitize(...) + exact: true +``` + +Once the source is exact, Semgrep no longer considers subexpressions as sanitized, and `sink("taint")` inside `sanitize(sink("taint"))` is reported as a tainted sink. + + + +For many rules, this distinction isn't meaningful, because it does not always make sense that a sink occurs inside the arguments of a sanitizer function. + +:::note +If any of your rules rely on non-exact matches, make this explicit by setting `exact: false` in your rule definition, even if this is the default setting. This ensures that your rule doesn't break if the default changes. +::: + +## Sinks + +You can specify a taint sink using a pattern. Like a search-mode rule, you can start this pattern with one of the following keys: + +- `pattern` +- `patterns` +- `pattern-either` +- `pattern-regex` + +Unlike sources and sanitizers, Semgrep doesn't consider the subexpressions of the matched expressions as sinks by default. + +Example: + +```yaml +pattern-sinks: +- pattern: sink(...) +``` + +Additionally, taint sinks accept the following options: + +| Option | Type | Default | Description | +| - | - | - | - | +| `exact` | {`false`, `true`} | `true` | See [Non-exact sinks](#non-exact-sinks). | +| `at-exit` (Pro) 🧪 | {`false`, `true`} | `false` | See [Restrict taint to at-exit sinks](/writing-rules/data-flow/taint-mode/advanced#restrict-taint-to-at-exit-sinks-). | + +### Non-exact sinks + +Given the following sink specification and a piece of code, such as `sink("foo" if tainted else "bar")`, Semgrep doesn't report the code as a tainted sink. + + +```yaml +pattern-sources: +- pattern: sink(...) +``` + +This is because Semgrep considers that the sink is the argument of the `sink` function, and the actual argument being passed is `"foo" if tainted else "bar"` that evaluates to either `"foo"` or `"bar"`, and neither of them is tainted. + +Semgrep takes into consideration the fact that the sink is the argument of the `sink` function and the actual argument being passed is `"foo" if tainted else "bar"`, which evaluates to either `"foo"` or `"bar"`. Neither of these values is tainted. + + + +You can instruct Semgrep to consider any of the subexpressions matching the sink pattern a taint sink by setting `exact: false`: + +```yaml +pattern-sinks: +- pattern: sink(...) + exact: false +``` + +Once the sink is non-exact, Semgrep considers subexpressions as taint sinks, and `tainted` inside `sink("foo" if tainted else "bar")` is now reported as a tainted sink. + + + +## Findings + +Taint findings are accompanied by a taint trace that explains how the taint flows from source to sink. + + + +### Deduplication of findings + +Semgrep tracks all possible ways that taint can reach a sink, but it only reports one taint trace, not all the possible options. You can use the following example to visualize this behavior: + +1. Click **Open in Playground**. +2. Run the example. Semgrep returns one match. +3. Expand the **Matches** section, and click **dataflow**.. + +Note that, even though `sink` can be tainted via `x` or via `y`, the trace will only show you one of these possibilities. If you replace `x = user_input` with `x = "safe"`, then Semgrep reports the taint trace via `y`. + + + +## Propagators 🧪 + +:::note +Custom taint propagators is a Semgrep Pro feature. +::: + +By default, tainted data automatically propagates through assignments, operators, and function calls (from inputs to output). However, there are other ways in which taint can propagate, but this requires language or library-specific knowledge that Semgrep does not have built in. + +You can define a taint propagator by specifying a pattern. Like search-mode rules, you can start this pattern with any of the following keys: + +- `pattern` +- `patterns` +- `pattern-either` +- `pattern-regex` + +A propagator also needs to specify the origin (`from`) and the destination (`to`) of the taint to be propagated. + +| Field | Type | Description | +| - | - | - | +| `from` | metavariable | Source of propagation | +| `to` | metavariable | Destination of propagation | + +In addition, taint propagators accept the following options: + +| Option | Type | Default | Description | +| - | - | - | - | +| `by-side-effect` | {`false`, `true`} | `true` | See [Propagate without side-effect](/writing-rules/data-flow/taint-mode/advanced#propagate-without-side-effect). | + +For example, given the following propagator, if taint goes into the second argument of `strcpy`, its first argument gets the same taint: + + +```yaml +pattern-propagators: +- pattern: strcpy($DST, $SRC) + from: $SRC + to: $DST +``` + +:::info +Taint propagators only work intraprocedurally, that is, within a function or method. You cannot use taint propagators to propagate taint across different functions/methods. For that, use [interprocedural analysis](#interprocedural-analysis-). +::: + +## Interprocedural analysis 🧪 + +:::info +Interprocedural taint analysis is a Semgrep Pro feature. +::: + +[Semgrep](/semgrep-pro-vs-oss/) can perform interprocedural taint analysis, that is, track taint across multiple functions. + +In the following example, `user_input` is passed to `foo` as input, and from there, flows to the sink at line 3 through a call chain involving three functions. Semgrep can track this flow and report the sink as tainted. Semgrep also provides an interprocedural taint trace that explains how exactly `user_input` reaches the `sink(z)` statement. To see this, click **Open in Playground**, then find the **Matches** panel and click **dataflow**. + + + +Using the CLI option `--pro-intrafile` when invoking Semgrep, Semgrep performs interprocedural (across functions), _intra_-file (within one file) analysis. In other words, Semgrep tracks taint across functions, but it will not cross file boundaries. This is supported for essentially every language, and performance is very close to that of intraprocedural taint analysis. + +Using the CLI option `--pro`, Semgrep will perform interprocedural (across functions) as well as *inter*-file (across files) analysis. Inter-file analysis is only supported for [a subset of languages](/supported-languages#language-maturity-summary). For a rule to run interfile, it also needs to set `interfile: true`: + +```yaml +options: + interfile: true +``` + +### Memory requirements for inter-file analysis + +While interfile analysis is more powerful, it also demands more memory resources. The Semgrep team advises a minimum of 4 GB of memory per core, but **recommends 8 GB per core or more**. The specific amount of memory needed depends on the codebase and on the number of interfile rules being run. diff --git a/docs/writing-rules/glossary.md b/docs/writing-rules/glossary.md index e9dbfd8818..60599a37dc 100644 --- a/docs/writing-rules/glossary.md +++ b/docs/writing-rules/glossary.md @@ -119,7 +119,7 @@ There are two types of rules: **search** and **taint**.
Taint rules
-
Taint rules make use of Semgrep's taint analysis in addition to default search functionalities. Taint rules are able to specify sources, sinks, and propagators of data as well as sanitizers of that data. For more information, see Taint analysis documentation.
+
Taint rules make use of Semgrep's taint analysis in addition to default search functionalities. Taint rules are able to specify sources, sinks, and propagators of data as well as sanitizers of that data. For more information, see Taint analysis documentation.
diff --git a/docs/writing-rules/pattern-syntax.mdx b/docs/writing-rules/pattern-syntax.mdx index d83056990c..c370ac920f 100644 --- a/docs/writing-rules/pattern-syntax.mdx +++ b/docs/writing-rules/pattern-syntax.mdx @@ -563,7 +563,7 @@ For search mode rules, metavariables with the same name are treated as the same For taint mode rules, patterns defined **within** `pattern-sinks` and `pattern-sources` still unify. However, metavariable unification **between** `pattern-sinks` and `pattern-sources` is **not** enabled by default. -To enforce unification, set `taint_unify_mvars: true` under the rule `options` key. When `taint_unify_mvars: true` is set, a metavariable defined in `pattern-sinks` and `pattern-sources` with the same name is treated as the same metavariable. See [Metavariables, rule message, and unification](/writing-rules/data-flow/taint-mode#metavariables-rule-message-and-unification) for more information. +To enforce unification, set `taint_unify_mvars: true` under the rule `options` key. When `taint_unify_mvars: true` is set, a metavariable defined in `pattern-sinks` and `pattern-sources` with the same name is treated as the same metavariable. See [Metavariables, rule message, and unification](/writing-rules/data-flow/taint-mode/advanced#metavariables-rule-messages-and-unification) for more information. ### Display matched metavariables in rule messages diff --git a/docusaurus.config.js b/docusaurus.config.js index 8866eb0cfb..575ccdc1ab 100644 --- a/docusaurus.config.js +++ b/docusaurus.config.js @@ -338,9 +338,9 @@ module.exports = { { from: "/experiments/extract-mode/", to: "/writing-rules/experiments/deprecated-experiments" }, { from: "/experiments/r2c-internal-project-depends-on/", to: "/writing-rules/experiments/r2c-internal-project-depends-on" }, { from: "/experiments/symbolic-propagation/", to: "/writing-rules/experiments/symbolic-propagation" }, - { from: "/experiments/taint-propagators/", to: "/writing-rules/data-flow/taint-mode" }, - { from: "/writing-rules/experiments/taint-propagators/", to: "/writing-rules/data-flow/taint-mode" }, - { from: "/experiments/taint-labels/", to: "/writing-rules/data-flow/taint-mode" }, + { from: "/experiments/taint-propagators/", to: "/writing-rules/data-flow/taint-mode/overview" }, + { from: "/writing-rules/experiments/taint-propagators/", to: "/writing-rules/data-flow/taint-mode/overview" }, + { from: "/experiments/taint-labels/", to: "/writing-rules/data-flow/taint-mode/overview" }, { from: "/experiments/metavariable-analysis/", to: "/writing-rules/metavariable-analysis" }, { from: "/experiments/multiple-focus-metavariables/", to: "/writing-rules/experiments/multiple-focus-metavariables" }, { from: "/experiments/display-propagated-metavariable/", to: "/writing-rules/experiments/display-propagated-metavariable" }, @@ -482,6 +482,9 @@ module.exports = { { from: "/kb/semgrep-appsec-platform/find-specific-findings" , to: "/kb/semgrep-appsec-platform/search-filter-sort-findings" }, /* JUL 25, 2025 */ { from: "/semgrep-supply-chain/upgrade-guidance" , to: "/semgrep-supply-chain/triage-and-remediation" }, + /* OCT 3, 2025 */ + { from: "/writing-rules/data-flow/taint-mode", to: "/writing-rules/data-flow/taint-mode/overview" }, + ] } ], diff --git a/release-notes/april-2022.md b/release-notes/april-2022.md index 98b5f07048..d12b2d3289 100644 --- a/release-notes/april-2022.md +++ b/release-notes/april-2022.md @@ -45,7 +45,7 @@ These release notes encompass upgrades for all versions ranging between **0.87.0 ### Breaking changes -- taint-mode: Unification of metavariables between sources and sinks is no longer enforced by default. It was not clear that this is the most natural behavior as it was confusing even for experienced Semgrep users. Instead, each set of metavariables is now considered independent by Semgrep. The metavariables available to the rule message are all metavariables bound by `pattern-sinks`, and the subset of metavariables bound by `pattern-sources` that do not collide with the ones bound by `pattern-sinks`. We do not expect this change to break many taint rules because source-sink metavariable unification had a bug (see [#4464](https://github.com/semgrep/semgrep/issues/4453)) that prevented metavariables bound by a `pattern-inside` to be unified, thus limiting the usefulness of the feature. Nonetheless, it is still possible to enforce metavariable unification by setting `taint_unify_mvars: true` in the rule options. For more information, see section [Metavariables, rule message, and unification](/writing-rules/data-flow/taint-mode/#metavariables-rule-message-and-unification). +- taint-mode: Unification of metavariables between sources and sinks is no longer enforced by default. It was not clear that this is the most natural behavior as it was confusing even for experienced Semgrep users. Instead, each set of metavariables is now considered independent by Semgrep. The metavariables available to the rule message are all metavariables bound by `pattern-sinks`, and the subset of metavariables bound by `pattern-sources` that do not collide with the ones bound by `pattern-sinks`. We do not expect this change to break many taint rules because source-sink metavariable unification had a bug (see [#4464](https://github.com/semgrep/semgrep/issues/4453)) that prevented metavariables bound by a `pattern-inside` to be unified, thus limiting the usefulness of the feature. Nonetheless, it is still possible to enforce metavariable unification by setting `taint_unify_mvars: true` in the rule options. For more information, see section [Metavariables, rule message, and unification](/writing-rules/data-flow/taint-mode/advanced#metavariables-rule-messages-and-unification). - The `semgrep/semgrep` Docker image no longer sets `semgrep` as the entry point. This means that Semgrep is no longer prepended automatically to any command you run in the image. This makes it possible to use the image in CI executors that run provisioning commands within the image. Affected users may receive a deprecation notice. Adjust scripts accordingly. ### Additions diff --git a/release-notes/august-2022.md b/release-notes/august-2022.md index 6f24ccc401..5e8aa3a756 100644 --- a/release-notes/august-2022.md +++ b/release-notes/august-2022.md @@ -56,7 +56,7 @@ Minor bug fixes are not included in the release notes unless they are potentiall - Consistent and exhaustive documentation about continuous integration (CI) both with and without Semgrep App: - [Running Semgrep in continuous integration (CI) with Semgrep App](/deployment/core-deployment) - [Running Semgrep in continuous integration (CI) without Semgrep App](/deployment/oss-deployment) -- Experimental taint propagators allow you to specify additional structures through which taint propagates. See how to use them in the [Propagators](/writing-rules/data-flow/taint-mode#propagators-pro) section. +- Experimental taint propagators allow you to specify additional structures through which taint propagates. See how to use them in the [Propagators](/writing-rules/data-flow/taint-mode/overview#propagators-) section. - Updated [Generic pattern matching](/writing-rules/generic-pattern-matching) documentation, rewritten examples, and added new sections, including a new [Handling line-based input](/writing-rules/generic-pattern-matching/#handling-line-based-input) section. - Introduced interface and color changes to fit new [semgrep.dev](https://semgrep.dev/) website design. - Report vulnerabilities that Semgrep should have found, but did not. You can report these false negatives directly from your command-line using a built-in Semgrep flag. See [Reporting false negatives with shouldafound](/reporting-false-negatives) article. diff --git a/release-notes/december-2022.md b/release-notes/december-2022.md index 91211b242d..badb2e87e7 100644 --- a/release-notes/december-2022.md +++ b/release-notes/december-2022.md @@ -71,5 +71,5 @@ These release notes include upgrades for versions ranging between 1.0.0 and 1.2. - [Autofix](/writing-rules/autofix) - [Generic pattern matching](/writing-rules/generic-pattern-matching) - [Metavariable analysis](/writing-rules/metavariable-analysis) - - Taint propagators - moved to [Taint tracking](/writing-rules/data-flow/taint-mode#propagators-pro) documentation + - Taint propagators - moved to [Taint tracking](/writing-rules/data-flow/taint-mode/overview#propagators-) documentation - Updated screenshots in Semgrep App documentation. Many additional improvements and fixes were made. diff --git a/release-notes/november-2022.md b/release-notes/november-2022.md index 1bd05ea0d2..7d1c9b8a11 100644 --- a/release-notes/november-2022.md +++ b/release-notes/november-2022.md @@ -54,7 +54,7 @@ These release notes include upgrades for versions ranging between 0.120.0 and 0. ### Changes -- taint-mode: Semgrep’s taint analysis now provides basic field sensitivity support. See [Field sensitivity](/writing-rules/data-flow/taint-mode/#field-sensitivity) section for more details. +- taint-mode: Semgrep’s taint analysis now provides basic field sensitivity support. See [Field sensitivity](/writing-rules/data-flow/taint-mode/advanced#taint-mode-sensitivity) section for more details. ## Semgrep in CI @@ -89,7 +89,7 @@ These release notes include upgrades for versions ranging between 0.120.0 and 0. #### Semgrep CLI - The [Experiments](/writing-rules/experiments/introduction) category now provides an introduction, in addition, the [Deprecated experiments](/writing-rules/experiments/deprecated-experiments) section is now an independent document. -- Added [Field sensitivity](/writing-rules/data-flow/taint-mode/#field-sensitivity) section to taint analysis documentation. +- Added [Field sensitivity](/writing-rules/data-flow/taint-mode/advanced#taint-mode-sensitivity) section to taint analysis documentation. ### Changes diff --git a/release-notes/october-2022.md b/release-notes/october-2022.md index 271ff731e8..deda7cd088 100644 --- a/release-notes/october-2022.md +++ b/release-notes/october-2022.md @@ -48,7 +48,7 @@ These release notes include upgrades for versions ranging between 0.116.0 and 0. - Taint mode now tracks taint coming from the default values of function parameters. For example, given `def test(url = "http://example.com"):`, if `"http://example.com"` is a taint source (as a consequence of not using TLS protocol), then `url` is marked as tainted. (Issue [#6298](https://github.com/semgrep/semgrep/issues/6298)) - Two new rule options that help to minimize false positives: - The`taint_assume_safe_indexes`, which makes Semgrep assume that an array-access expression is safe even if the index expression is tainted. Otherwise Semgrep assumes that for example: `a[i]` is tainted if `i` is tainted, even if `a` is not. Enabling this option is recommended for high-signal rules, whereas disabling is preferred for audit rules. Currently, it is disabled by default to attain backwards compatibility, but this can change in the near future after some evaluation. To enable this option, include the `taint_assume_safe_indexes: true` under the `options` key. For more information, see [Rule syntax](/writing-rules/rule-syntax/#options) documentation. (PR [#6327](https://github.com/semgrep/semgrep/pull/6327)) - - The `taint_assume_safe_functions`, makes Semgrep assume that function calls do **not** propagate taint from their arguments to their output. Otherwise, Semgrep always assumes that functions may propagate taint. This is intended to replace **not-conflicting** sanitizers (added in v0.69.0, for more information, see [Minimizing false positives via sanitizers](/writing-rules/data-flow/taint-mode#sanitizers)) in the future. This option is still experimental and needs to be complemented by other changes in future releases. To enable this option, include the `taint_assume_safe_functions: true` under the `options` key. For more information, see [Rule syntax](/writing-rules/rule-syntax/#options) documentation. (PR [#6327](https://github.com/semgrep/semgrep/pull/6327)) + - The `taint_assume_safe_functions`, makes Semgrep assume that function calls do **not** propagate taint from their arguments to their output. Otherwise, Semgrep always assumes that functions may propagate taint. This is intended to replace **not-conflicting** sanitizers (added in v0.69.0, for more information, see [Minimizing false positives via sanitizers](/writing-rules/data-flow/taint-mode/overview#sanitizers)) in the future. This option is still experimental and needs to be complemented by other changes in future releases. To enable this option, include the `taint_assume_safe_functions: true` under the `options` key. For more information, see [Rule syntax](/writing-rules/rule-syntax/#options) documentation. (PR [#6327](https://github.com/semgrep/semgrep/pull/6327)) - It is now possible to use `pattern-propagators` to propagate taint through higher-order iterators such as `forEach` in Java. For example: ```yaml diff --git a/release-notes/september-2022.md b/release-notes/september-2022.md index cd88d492b0..961db0ef31 100644 --- a/release-notes/september-2022.md +++ b/release-notes/september-2022.md @@ -59,13 +59,13 @@ Minor bug fixes are not included in the release notes unless they are potentiall ## Documentation updates -- New documentation for experimental [Taint labels](/writing-rules/data-flow/taint-mode#taint-labels-pro-). +- New documentation for experimental [Taint labels](/writing-rules/data-flow/taint-mode/advanced#taint-labels-). - New documentation for [Display matched metavariables in rule messages](/writing-rules/pattern-syntax/#display-matched-metavariables-in-rule-messages) and experimental [Displaying propagated value of metavariables](/writing-rules/experiments/display-propagated-metavariable). - New documentation for [Using multiple focus metavariables](/writing-rules/experiments/multiple-focus-metavariables). - Added information about [Ellipsis operator scope](/writing-rules/pattern-syntax/#ellipsis-operator-scope). - Many documents, such as [Getting started with Semgrep App](/deployment/core-deployment) now display minimal Semgrep tier required for a particular feature documented on the page. - Updated [Managing findings in Semgrep App](/semgrep-code/findings). -- [Taint mode](/writing-rules/data-flow/taint-mode) documentation has been updated and now includes introductory video. +- [Taint mode](/writing-rules/data-flow/taint-mode/overview) documentation has been updated and now includes introductory video. - Updated [Getting started with Semgrep in continuous integration (CI)](/deployment/core-deployment) - Updated [Data-flow analysis engine overview](/writing-rules/data-flow/data-flow-overview). - Updated [Integrating Semgrep into source code management (SCM) tools](/deployment/connect-scm). diff --git a/sidebars.js b/sidebars.js index 4a8bd4ba56..1f373a3902 100644 --- a/sidebars.js +++ b/sidebars.js @@ -453,9 +453,17 @@ module.exports = { type: 'category', label: 'Dataflow analysis', link: {type: 'doc', id: 'writing-rules/data-flow/data-flow-overview'}, + collapsible: false, items: [ 'writing-rules/data-flow/constant-propagation', - 'writing-rules/data-flow/taint-mode', + { + type: 'category', + label: 'Taint analysis', + link: {type: 'doc', id: 'writing-rules/data-flow/taint-mode/overview'}, + items: [ + 'writing-rules/data-flow/taint-mode/advanced' + ] + }, 'writing-rules/data-flow/status' ] }, diff --git a/src/components/concept/_semgrep-code-display-tainted-data.mdx b/src/components/concept/_semgrep-code-display-tainted-data.mdx index 819be7052a..2b86f24867 100644 --- a/src/components/concept/_semgrep-code-display-tainted-data.mdx +++ b/src/components/concept/_semgrep-code-display-tainted-data.mdx @@ -2,7 +2,7 @@ With **dataflow traces**, Semgrep Code can provide you with a visualization of the path of tainted, or untrusted, data in specific findings. This path can help you track the sources and sinks of the tainted data as they propagate through the body of a function or a method. For general information about taint -analysis, see [Taint tracking](/writing-rules/data-flow/taint-mode). +analysis, see [Taint tracking](/writing-rules/data-flow/taint-mode/overview). When running Semgrep Code from the command line, you can pass in the flag `--dataflow-traces` to use this feature.