Re-ally fast

- (6 min read)

Lately I have been trying to get into the Rust ecosystem to get a feel of the tools and the language. In the process I came across a few tools that I would highly recommend to try out.

  • ripgrep - A blazing fast alternative to GNU grep
  • alacritty - A GPU accelerated terminal emulator
  • exa - A replacement for ls written in Rust

The tools exa is for a more pretty ls, may not appeal to everyone. Similar is the case for alacritty, it is intended to be used by people who are comfortable with tmux as it doesn't have tabs, kind of a bummer but if you roll with tmux, it should not make any difference anyways.

The tool that I am excited about is, ripgrep which is a crazy fast alternative for GNU grep. It has a lot more features than grep and the speed it chugs through text using regular expressions is amazing.

A more detailed and in-depth explanation and benchmarks can be found at ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} which is blog entry from the creators.

Some numbers

Let's first see ripgrep in action (so that there is some substance to my claims).

Destruction

The above test is to do a case insensitive search on a text file which is close to 1.2GB of size. The grep took more than 24s while ripgrep casually completed the search within half of a second which is just pure destruction in terms of speed. The shasum part is to make sure the output of both the tools are correct. I would definitely trust grep for it's correctness and the above screenshot shows both produce the same output hashes, which means ripgrep's output is also correct.

Unless I stumbled upon a pair of texts that tend to produce the same SHA-256 hash. SHAttered 256 times bada**!.

What's up with ripgrep ?

Restricted but fast regex library

A portion of the speed gain comes from not complying to the PCRE standards which supports all features including look-ahead and look-behind features.

eg: If you try the following regex in Rust, it fails

extern crate regex; // 1.3.9

use regex::Regex;

fn main() {
    let haystack = "world's";
    let re = Regex::new(r"\w+(?<!s)\b").unwrap();
    println!("{}", re.is_match(haystack));
}

with the following error

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    \w+(?<!s)\b
       ^^^^
error: look-around, including look-ahead and look-behind, is not supported
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On the other hand, python3 implements look-ahead and look-back syntax.

Python 3.7.4 (default, Oct 12 2019, 18:55:28)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> t = re.compile(r"\s\w+(?<!s)\b")
>>> t.findall("hello world's")
[' world']

Better optimizations to the literal string comparision

Regex being a general purpose search mechanism can be slow and can be beaten by classical string search algorithms like Boyer-Moore.

The regular brute-force string searches or the python in keyword work by traversing the haystack until it finds the first character of the needle, then it extracts the substring of the length of the needle and then matches. If the string does not match, increment to the next char in the haystack and continue.

>>> needle="rust"
>>> haystack="searching for rust lang"
>>> for i in range(len(haystack) - len(needle) + 1):
...     if haystack[i: i + len(needle)] == needle:
...             print ("Found at", i)
...
Found at 14

This can be seen as aligning the needle in all possible ways against the haystack and then comparing to see which one matches perfectly. Something like

A N P A N M A N -
P A N - - - - - -
- P A N - - - - -
- - P A N - - - -
- - - P A N - - -
- - - - P A N - -
- - - - - P A N -

What Boyer-Moore does is, with some pre-computation magic, it will try to find the best positions that needs to actually be tested. It does so by finding the byte offsets at which the last byte of the needle matches the haystack. If the last byte does not match, there is no point visiting the chars before that byte offset in that alignment. This search of "candidates" is done by memchr which can find the first position of a char in a memory area. The fancy thing about memchr is that, it compiles down to SMID instruction.

This means that it can find the search candidates really fast due to vectorized instructions. The regex crate gwill go to great lengths to extract literal strings from the regular expressions and then find search candidates in order to minimise the regex search time.

Most of the tools search for a needle in a haystack that is made up of lines of text, i.e. delimited lines. The regex crate will find the lines that may match and then run the full regex search on it saving a lot of time from the regex overhead.

For multiple literals, i.e. "you|me", the regex crate will use plain Aho-Corasick or a vectorized algorithm called Teddy when enabled.

To be honest, I don't understand a lot of it. I will try and implement the Boyer-Moore algorithm someday maybe to get a better understanding of what is going on.

Substring search algorithms

  • The z-algorithm is a pretty good choice which requires O(n + m) worst case time and space if the needle and the haystack don't change.

  • If there are a lot of needles to search from a small number of haystacks, probably a Trie would be more suited. This will not scale for a large number of haystacks as all of the tries will be needed to be present in memory at the same time which is not possible. For example, in the above 1.2G exmaple, having that amound of storage occupied all the time may not be necessary.

  • If the needle isn't changing and there are a lot of haystacks, some algorithm that pre-processes the needle might help, just like Aho-Corasick or Boyer-Moore.

Disclaimer

All of this is shamelessly ripped off of the Burntsushi/ripgrep blog and it's highly recommended to read that if you need a clearer picture.