Introduction to Semgrep
Semgrep is an open-source static analysis tool that helps catch security vulnerabilities, misconfiguration, and deviations from standard practices. It’s a powerful tool that can be leveraged to write rules to detect specific or generic security and non-security-related issues and loopholes. It supports more than 20+ languages and is used by many top companies like Slack, UXCam, etc.
It can be installed across different operating systems including Linux, mac, and docker. It can also be integrated into the CI/CD pipeline, which allows you to scan for vulnerabilities before each build.
brew install semgrep
python3 -m pip install semgrep
Once installed, you can run it with
semgrep --config auto /path/to/repo
Writing custom rules
Writing semgrep rules is not any rocket science and does not require you to understand any complex things. Consider the following simple rule which catches the usage of print statements in python. Semgrep provides an interactive way to write and test rules. It’s written in YAML format. You can try this out.
Please focus on pattern and languages. Language simply means programming language targetted and pattern matches the defined pattern. Here, it simply matches all the occurrences of the print statement with string as its argument. If you change the semgrep rule to print(…) then it also matches the remaining print statements regardless of the type of argument.
We can see that the rule has matched all print statements. If you take a closer look then you should notice the use of dot-like operator “…” which means ellipsis in semgrep terminology. It simply means match everything. For eg. consider the following rule
message: Semgrep found a match
This rule matches anything between print(“Hi”) and print(“hello”)
It matches everything from line 5 all the way to line 10 but does not match line 12 and line 14.
Let’s try matching functions that are either named “this” or “that”
You can notice the use of pattern-either. With this you can write multiple patterns and only of the patterns mentioned can be matched. In the example above, it matches the functions that have “this” or “that” in the name. We can also match any function regardless of its name.
Notice the use of $F in line number 3 above. Here it basically means the function name can be anything. As we don’t know the name of the function, we can simply reference it with $ followed by a capital letter or word to indicate the name. We could also use $A or $B in the example above. This is what we call metavariable in semgrep world. To make it clearer, consider the following rule.
Here, the pattern matches the code that takes input directly from the user as a get parameter and passes that into eval. We don’t know in advance the variable names like the data in the example above. Also, the code might take input from either the “GET” or “POST” method. That is why we use $Y metavariable.
Let’s write a simple rule that catches missing authentication in API endpoints.
This rule catches API endpoints that don’t have @authorize before the function name. It explicitly looks for anything that accepts user input from “GET”, and “POST” parameters and checks if there is @authorize immediately after. There are two operators used; pattern-not-inside and pattern. “Pattern-not-inside” ensures that there is not a finding inside the “pattern” defined. In the example above, it checks for missing @authorize. Running this rule can yield API endpoints with missing authentication.
Now we’ve some basics covered. Let’s write a simple rule that can catch command injection vulnerabilities in python. We cover two scenarios :
User input is directly passed into os.system()
It has been able to catch user’s input passed directly into “os.system” but fails to catch where it’s referenced as a variable name.
User input is referenced from a variable and passed into os.system()
For this, we use the “pattern-either” operator to add and combine both cases.
There are various ways through which command injection can occur. I just showed two ways for the sake of example. There are already tons of community-built rules you can find for generic vulnerabilities. The power of semgrep lies in writing rules for your own organization to reap its full benefits.
Semgrep is a very powerful static analysis tool that you can use to protect your codebase from different kinds of security vulnerabilities. The use case depends on how you want to use this. For example, if you have a bug bounty program where you receive a submission then you can write a rule to find similar vulnerabilities across your code base before others find them. Every organization has its own coding and development patterns. Once we understand the methodology, we can write rules to catch common programming bugs and security loopholes. I hope this blog post gives you the inspiration to look more into semgrep and the capabilities it has to offer.