- text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. That might work for something small like a curated collection of a few hundred sites.
Yeah, this is the great part about working with search, you really do get to sort of go to the gym when it comes to software engineering breadth. From hardware, to algorithms, to networking, to architecture, to UX, there is really an interesting problem everywhere you turn to look. Even just writing a file to disk is a challenge when the file is several dozen gigabytes and needs to be written byte-by-byte in a largely random order.
This does go some way toward explaining why Google interviews looked the way they've looked. It's just a shame everywhere else has copied their homework, without actually needing the same skills.
Yes, yes, yes :D There are so many topics in this space that are so interesting it's like a dream. I would add to your list
- sentiment analysis
- roaring bitmaps
- compression
- applied linear algebra
- ai
In a vent diagram intersecting all of these topics, is search. Coding a search engine from scratch is a beautiful way to spend ones days, if you're into programming.
True story I once had a discussion with a developer about search in general like for your website, for the internet and the difficulties involved. Precision vs recall, relevancy vs popularity, ranking etc.
He was dumbfounded that i would want to spend two weeks 'tunning Solr queries' for a project. He asked( nay stated)
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. That might work for something small like a curated collection of a few hundred sites.