What is the app?

__MatrixMan__ · on March 5, 2023

It's a protocol for crowd sourced annotation data. So an example app that uses it would go something like this:

Suppose you have a food allergy, and you have a reaction to something in a restaurant. You want to leave a note on that menu item "contains allergen XYZ" but you don't want to write on that menu, you want to annotate all such menus.

You'd take a picture, OCR happens, some algorithm thinks about line wrapping and renders it as a list of strings, and then a rolling hash identifies the "features" among those strings (any substring hashes to a 16-bit integer, the ones where the first 8 bits are off count as features). Then you "paint" the menu entry in the color "contains allergen XYZ". The features that are nearby your "brushstroke" (i.e. text highlighting) are stored in a table for that "color" which is eventually synced with other users.

Later, someone else who subscribes to that "color" and has a trust relationship with the first user can scan the menu, follow the same process to find the features, which are used as indices to look up the brushstroke. Then they're able to see the annotation left by the other user as an overlay on the image they queried with: supposing they have the same allergen, they now know to avoid that item.

I'm calling the whole scheme Semantic Paint and the index-friendly-feature-finder Gnize (like cognize now, recognize later).

It's meant for local-ish use by small-ish communities, so the data you actually have to store on your device is pretty small and restricted to colors that you've chosen and other users you've explicitly (or transitively) trusted . And you're communicating over spans like weeks or months, so it's not a big deal if it takes a few days for one brushstroke to make it to another user's device. It's not like they notice its untimely arrival, it just goes in a feature database for later query.

I also thing it might have applications in genomics/proteomics, e.g. annotate a gene.

triyambakam · on March 6, 2023

Wow, fascinating. Thank you for detailing that. Do you have somewhere to follow its development?

__MatrixMan__ · on March 6, 2023

I'm tickled to that you're interested. "I have this cool idea involving algorithms on strings" isn't exactly everybody's favorite conversation.

Here's the old version, hardly even proves the concept: https://github.com/MatrixManAtYrService/gnize

Python was giving me trouble though, so I'm switching to Nim for performance reasons, and also because I can compile it to C, objective-C, and JavaScript for use by a wider variety of clients. It's just an empty shell right now but that project will end up here: https://github.com/gnize and hopefully soon.

triyambakam · on March 6, 2023

> "I have this cool idea involving algorithms on strings" isn't exactly everybody's favorite conversation.

Haha yeah I get you. I wouldn't have expected it would be for me either but the approach and use case you shared is really cool. Thanks for the links. It will be neat to see the development in Nim, as well.

cb321 · on March 7, 2023

You both might be interested in this little Nim program "framed" to frame & digest text for near-duplicate detection:

    https://github.com/c-blake/ndup/blob/main/framed.nim

which does a rolling hash with the late Bog Uzgalis' Buz Hash which is (IMO) about 10x simpler than Rabin fingerprinting. It's really just xor out old, xor in new.

In my context of near duplicate detection I worry about false negatives as well as false positives and displaying near-bys to the user & such. So, things are set up for random seeds for very independent framings & digests (to avoid unlucky ones). { This is very different from the more common "backup/rsync/transfer" context of "content-based slicing" aka "content-defined chunking". }

__MatrixMan__ · on March 7, 2023

Oh wow, this is great. We're up against similar problems here, but what I've got as unnamed musings you've got as links to relevant Wikipedia articles. I've got some reading and thinking to do.

I'll probably just include both algorithms, make it a parameter, and do a bunch of testing (which I'll be sure to publish). I can't put my finger exactly on why I feel better about Rabin fingerprints, maybe buzhash just needs time to sink in.

It's only recently that I've been thinking of this as a framing problem. The right words eluded me because I only care about the frames near the annotation. So instead of "multiple statistically independent framings" I've been thinking of separate "feature channels" (one for each prime polynomial in GF2) which cause different substrings to be identified as features, but I think it's more or less the same thought deep down.

I want a more or less uniform distribution of features so that wherever a user wants to put an annotation, there's always something nearby to anchor it to. In the event of a large gap I've imagined schemes to switch to different framings until its filled . Or maybe I show the features to the user and let them click something to "change the channel" until they get a set of features that works for their particular annotation.

Putting this as a nearly-duplicate-files problem makes me wonder if it could be used to arrange files into a graph based on how they differ. Like when browsers unhelpfully just increment a counter and redownloads the the file every time you click a link to it. It's hard to keep track of the fact that copy 6 is the one I want, but if they were rendered as a sort of constellation of files, where the edges are longer for more dissimilarity... That would be fun to play with.

Anyway, I'll be looking through this more in the future. Thanks for sharing it.

cb321 · on March 7, 2023

You are welcome. Thanks are too rarely offered. :-)

You may also be interested in word stemming ( such as used by snowball stemmer in https://github.com/c-blake/nimsearch ) or other NLP techniques, but I don't know how internationalized/multi-lingual that stuff is, but conceptually you might want "series of stemmed words" to be the content fragments of interest.

Similarity scores have many applications. Weights on graph of cancelled downloads ranked by size might be one. :)

Of course, for your specific "truncation" problem, you might also be able to just do an edit distance against the much smaller filenames and compare data prefixes in files or use a SHA256 of a content-based first slice. ( There are edit distance algos in Nim in https://github.com/c-blake/cligen/blob/master/cligen/textUt.... as well as in https://github.com/c-blake/suggest ). Or, you could do a little program like ndup/sh/ndup to create a "mirrored file tree" of such content-based slices then you could use any true duplicate-file finder (like https://github.com/c-blake/bu/blob/main/dups.nim) on the little signature system to identify duplicates and go from path suffixes in those clusters back to the main filesystem. Of course, a single KV store within one or two files would be more efficient than thousands of tiny files. There are many possibilities.