What is a Vector Database?
A vector database is a data store optimized for high-dimensional embeddings: it indexes vectors so that nearest-neighbor queries (find the k vectors most similar to this one) return results in milliseconds rather than the seconds a naive scan would require. Vector databases are the substrate for RAG pipelines, semantic search, recommendation engines, and any system where similarity in embedding space is the access pattern.
How vector databases work
The core operation is approximate nearest neighbor (ANN) search. Exact nearest-neighbor on millions of high-dimensional vectors is expensive — every query would need to compare against every stored vector. ANN algorithms trade a small accuracy loss for orders-of-magnitude speedup.
Common ANN algorithms:
- HNSW (Hierarchical Navigable Small World) — graph-based, excellent recall, used by most modern stores
- IVF (Inverted File Index) — partition vectors into clusters, search only relevant clusters
- Product Quantization — compress vectors so more fit in RAM, trade precision for footprint
- DiskANN, ScaNN, FAISS — research-backed implementations underpinning many products
Vector databases also store metadata alongside each vector (document text, source URL, timestamps, ACL tags) and support filtered queries — "find me documents similar to this one, but only those tagged engineering and modified in the last 30 days."
Common vector databases
- Cloud-native — Pinecone, Weaviate Cloud, Qdrant Cloud, Vespa Cloud
- Self-hosted open source — Chroma, Qdrant, Weaviate, Milvus, Vespa
- Postgres extensions — pgvector (most popular), Lantern
- Existing data store extensions — Redis, Elasticsearch, MongoDB Atlas all added vector support
- Embedded / local — LanceDB, Chroma, sqlite-vec (for prototypes and edge deployments)
Security implications
Vector databases inherit the security concerns of any data store, plus a few specific to vector data:
- The corpus itself is sensitive. If the database holds embeddings of internal documents, the embeddings are reversible enough (via inversion attacks) that they should be treated like the source content for access control purposes.
- Metadata-filter bypass — when ACL is enforced via metadata filters at query time, errors in filter logic can leak vectors the user shouldn't see.
- Embedding-space poisoning — attackers who can write to the database (e.g. via user-generated content that gets indexed) can craft adversarial documents that get retrieved on target queries, weaponizing the retrieval layer for prompt injection.
- Side-channel via query latency — in some configurations, the time taken to return results leaks information about what's in the corpus.
Securing a vector database requires per-query ACL enforcement, write-side validation of indexed content, and audit logging of retrievals against sensitive partitions.