Hash filtering - an appeal to more than vanity

DRAFT

Now published here

tl;dr: The major password-cracking projects should add support for general hash substring searching. This has value beyond vanity hashes, so I'd like to give it a new name: hash filtering.

What is a vanity hash?

A vanity hash is a hash that contains an interesting substring. For example, my actual Bitcoin address is:

1AzVVHTijMGoWzPtUJoWkF5u6vyS1NcTzj

But it would be cooler if it was this:

RoyceHTijMGoWzPtUJoWkF5u6vyS1NcTzj

That's a vanity hash. Finding one requires a brute-force search. The more characters you want, the harder it gets -- exponentially harder. But even just a few characters can be good enough.

Current examples of vanity hashes

We humans are vain creatures -- so when hashes are used as public identifiers, we want to personalize them:

Bitcoin addresses can be "vanity addresses".
Onion hashes and GPG keys can be vanity hashes.
Tripcodes (used to prove identity on image boards) have vanity hash tools that can search for descrypt-like vanity hashes (at speeds in the 950Mh/s range!)
Git commit hashes can be vanity hashes if patched to support it.
Some folks want to personalize certificate signatures.

Why hash substring searching is interesting

Finding substrings in hashes isn't just about vanity. It enables other interesting work:

Working with non-contiguous substrings. Most vanity-hashing tools won't let you search for a hash that begins with "[aA][bB][lL][eE]" and ends with "[eE][lL][bB][aA]". But password crackers' mask syntax is perfect for this.
Experimenting with collisions and partial collisions. This is a potential attack surface that is not adequately publicly explored. Nation-states are almost certainly capable of locating near-collisions for some hashes. When you check a download hash, how often is that check only a cursory visual check? True collisions for slow hashes are nigh-infeasible, of course -- but near-collisions may be interesting.
Demonstrating weaknesses. For example, short GPG key IDs can have collisions.
Filtering for hashes with broad properties (instead of just substrings). Hashes that contain only digits, or only letters, or letters and digits alternating, or only lower case (for hashes that distinguish case). Perhaps only a curiosity today, but they might be interesting in the future.
Automatic/loose discovery of similar strings. This would be a nice-to-have at some point, but searching by edit distance would complement substring searching nicely. A minimum or maximum edit distance could be specified to find strings that are "close enough". One of the most common measurements of edit distance is Levenshtein distance, which has been implemented in most languages, and implemented efficiently on GPU.
Unforeseeable innovation. Once this capability is available across all hash types, people will tinker with it, revealing new use cases. I've seen some CTFs that involve searching for specific substrings in hashes. I think they're on to something.

So as you can see, it's not just about vanity.

That's why I want to give the overall class of hash substring searching -- collisions, vanity hashes, etc. -- a different name.

Let's call it hash filtering.

Why the password crackers are the best place to do hash filtering

It's efficient. No one is better positioned to implement hash filtering than the password crackers -- and doing it anywhere else is a waste of resources. Each standalone implementation invents its own hash bruteforcing. This rarely approaches the speeds that JtR and hashcat are capable of (though there are exceptions). The Bitcoin folks have an edge today because of FPGAs, but I expect this gap to close.
It's high leverage. If implemented as a general framework within password crackers, all current and future hash types automatically inherit substring search capability, with little additional effort.

How the password crackers might respond

OK -- but it should be done as a separate executable. This is a waste of time. The best features of professional-grade password crackers -- core brute-force power, Markov, forking/parallelism, session management, masks, device selection, wordlist management, etc. -- would either be unnecessarily duplicated, left out, or suffer from bit rot.
No. We're not interested in vanity hashing as a concept. That would be a mistake. Vanity hashing is already driving interest, contribution -- and even innovation (like entirely new S-boxes).
No. It might attract noobs. Unavoidable -- but this should be offset by an influx of insight and talent. Also, we have to feed a certain number of noobs to Xanadrel, the Sarlacc of #hashcat. ;)
No. It adds complexity. Technically true, but it should be a largely one-time cost. Also, I think that implementing Levenshtein distance on GPU would be kinda fun if I was atom or Sayantan. ;)
No. It will slow down cracking too much. Hash filtering should be implemented as an early pre-optimization step, so that you only have to take the inevitable speed hit if you want hash filtering. Regular cracking would be unaffected.
No. Hash filtering itself will be too slow. By its nature, it will have to be much slower than password cracking. But if it can be pre-optimized (see previous point), then it can take advantage of the rest of the cracking support structure, so that hash filtering will be faster, more portable, and easier to use than it will be anywhere else.

Implementation suggestions

Allow the user to specify a "hash filter", using existing mask syntax. This would automatically enable specifying one or more desired substrings anywhere in the hash -- beginning, end, or middle -- as well as both non-contiguous strings and loose matching by character set.
On the command line, consider long options with names like'--hash-filter [mask]' and '--edit-distance [integer]'. Or name it something else -- but keep it consistent across projects.
Implement in the pre-optimization pass, so that the slowdown introduced by examining each hash will only kick in when hash filtering is explicitly requested.
Add simple sanity checking. If the user supplies a mask that is longer than the target hash, warn rather than silently truncating. If the user wants to filter for a mask that isn't possible (for example, with a descrypt hash, a filter with the two characters outside of [.26AEIMQUYcgkosw]).

Potential bounty?

If this proposal is not obviously compelling, I will consider setting up a bounty (or charitable donation of the winner's choosing). The bounty would go to each major natively Linux-based project (John the Ripper or hashcat) that incorporates hash filtering. Edit distance would be optional.

If you want to go in on a bounty, contact me and I'll do the Bountysource setup.