How does semantic search work?

A model converts both your query and each passage of the document into a vector embedding - a list of numbers representing meaning - then finds the passages whose vectors are closest to the query's vector.

Does semantic search still use keywords at all?

It doesn't need to, but production systems often blend semantic and keyword matching (hybrid search) so rare terms - product codes, names, exact phrases - aren't missed just because they're semantically ambiguous.

Semantic search - Definition

Q: Is semantic search better than keyword search?

Better at different things. Semantic search wins when the query and the document use different words for the same idea; keyword search wins for exact strings - names, codes, error messages. Most strong search systems combine both, a pattern called hybrid search.

Semantic search finds text by meaning rather than by matching exact words. Instead of looking for the literal keywords you typed, it understands what you are asking and retrieves passages that express the same idea - even when they share no words with your query.

Why it matters

Traditional keyword search is brittle. Ask a document about its "cancellation policy" and a keyword index will miss a paragraph headed "ending your subscription," because the words don't match. Semantic search closes that gap: it matches concepts, so a search for "how do I leave" surfaces the right passage regardless of the exact phrasing the author used. For long documents, wikis, and research papers - where the same idea is written a dozen different ways - this is the difference between finding the answer and scrolling forever.

How it works

Semantic search relies on vector embeddings: a model converts each chunk of text into a list of numbers that captures its meaning, and converts your query the same way. Passages whose vectors sit closest to the query vector are the most relevant, and those are returned first.

The typical pipeline is:

Index. Split the document into passages and embed each one.
Query. Embed the question and find the nearest passages by vector similarity.
Rank. Return the best-matching passages, often re-scored for precision.

A worked example

Say a 40-page employee handbook has a section titled "Ending your employment" that covers notice periods, final pay, and returning equipment. An employee searches the handbook for "how do I quit."

A keyword search looks for the literal words "quit" or "how" or "I" - none of which appear in that section heading or its text. It either returns nothing useful or surfaces unrelated pages that happen to contain "how" or "I" often enough to rank.

A semantic search embeds "how do I quit" as a vector representing the concept of voluntarily leaving a job. It compares that vector to the embeddings of every passage in the handbook, and the "Ending your employment" section - despite sharing zero words with the query - sits closest in meaning, because resignation, notice periods, and quitting are the same underlying idea expressed differently. That section comes back first, with the exact paragraph about notice periods ranked above general boilerplate.

The same effect shows up constantly in real documents: a research paper that never says "downside" but has a whole section called "Limitations," a contract that never says "cancel" but has a clause titled "Termination for convenience," a wiki page about "leave policy" that answers a search for "can I take a sick day." In each case the reader's words and the author's words differ, but the underlying meaning lines up - and that's exactly the gap semantic search is built to close.

Semantic vs. keyword search - when each wins

	Semantic search	Keyword search
Matches on	Meaning and concept	Literal words and phrases
Finds paraphrases	Yes - "cancel my plan" matches "end your subscription"	No - needs the same words to appear
Exact strings (codes, names, IDs)	Can miss or under-rank them	Excels - exact match is the point
Needs the same vocabulary as the document	No	Yes
Typical weakness	Rare or highly specific terms can get diluted by "similar" but wrong results	Misses synonyms, rephrasing, and questions worded differently from the source
Best combined as	Hybrid search - semantic + keyword together, for coverage and precision

Neither approach is strictly better in isolation. A support wiki full of product SKUs and error codes needs keyword precision as much as it needs semantic recall; a research paper full of paraphrased ideas needs semantic recall more than exact-string matching. That's why production systems increasingly run both and merge the results, rather than betting everything on one technique.

When each wins, concretely:

Semantic search wins when the reader's words and the author's words differ - questions ("how do I cancel?"), paraphrases, synonyms, and concept lookups in long documents.
Keyword search wins when the query is an exact string - product codes, error messages, clause numbers, names - where "approximately right" is exactly wrong.
Hybrid search wins overall. Run both and merge the results, and neither failure mode is fatal: exact strings still match exactly, and paraphrases still surface.

Where it fits in Sidenote

Semantic search is the retrieval step that makes grounded answers possible. When you ask Sidenote a question, it semantically searches the document you are reading to pull the passages that actually address it, then feeds only those to the model - the same retrieve-then-answer pattern as retrieval-augmented generation. Because the answer is built from real passages, every claim can carry a citation that scrolls straight to the source sentence. Good search is what lets Sidenote say "here is the exact line," rather than guessing.

FAQ

What is semantic search?

Semantic search is a retrieval technique that matches a query to passages by meaning rather than by shared words. Both the query and the document's passages are converted into vector embeddings - numeric representations of meaning - and the passages closest to the query in that vector space are returned, even when they use completely different vocabulary.

What are vector embeddings?

A vector embedding is a list of numbers a model produces to represent a piece of text, positioned so that texts with similar meanings end up numerically close together. Embeddings are what make semantic search computable: "how do I quit" and "ending your employment" land near each other in the vector space because they mean nearly the same thing.

Is semantic search better than keyword search?

Better at different jobs. Semantic search finds paraphrases and concepts that keyword search misses entirely; keyword search nails exact strings - names, codes, quoted phrases - that semantic matching can blur. Most production systems run both and merge the results as hybrid search, which beats either technique alone.

Does semantic search work on scanned documents?

Only after the text has been extracted. A scanned PDF is an image, so it first needs OCR to recover machine-readable text; once the words exist as text, they can be embedded and searched semantically like any other document.

Stop digging. Start asking.