Code-Memo

✅ STEP 5: High-Level Architecture — Google Search

This will be an extremely detailed breakdown of how a search engine like Google operates at scale — including crawling, indexing, ranking, real-time freshness, and multi-datacenter search serving.

We’re designing a system that:

Crawls and indexes billions of webpages
Serves millions of queries per second
Returns results in <100ms
Supports real-time freshness, ranking, ads, and personalization

🔧 1. Key Architectural Principles

Principle	Description
Massive Parallelism	Petabytes of data processed across thousands of nodes
MapReduce Pipelines	For crawling, indexing, link analysis
Shard Everything	Docs, queries, cache, ads, etc.
Edge Caching	For hot queries/snippets
Near Real-Time Sync	Index updates within seconds/minutes
Query Understanding	NLP + semantic models (BERT, MUM)

🧩 2. System Overview

We split the system into:

Client Query Entry
Frontend Layer
Query Understanding & Rewriting
Index Lookup
Document Ranking
Blending, Snippets, Ads
Final Rendering

Meanwhile, there’s a massive offline subsystem that handles:

Web Crawling
Indexing & Sharding
Link Analysis (PageRank)
Model Training (Ranking, BERT, Ads)

📦 3. Component Breakdown

✅ 3.1. Client Layer

Source	Description
Web Search Bar	Standard text query input
Mobile Voice Input	Transcribed to text via speech-to-text
Browser/Search App	May pass device context, history, geo info

✅ 3.2. Frontend (Search Gateway)

Routes query to appropriate datacenter (geo-DNS or BGP routing)
Tracks session, personalization token
Applies A/B testing configs
Handles quotas, abuse detection

✅ 3.3. Query Understanding & Rewriting

Subsystem	Purpose
Spell Correction	“Did you mean…?”
Query Rewriting	Normalize synonyms, plural/singular
NLP Parsing	Named entity recognition (NER), POS tagging
Intent Classifier	Is it navigational, transactional, informational?
Query Expansion	Expand query with similar/related terms
Language Detection	Multilingual support
Contextual Signals	Based on user history, location, device

✅ 3.4. Index Lookup Layer

Index is pre-built, inverted (term → docID list)
Index is sharded horizontally, each shard holds a portion of the corpus
Each shard returns relevant documents with:
- BM25/TF-IDF scores
- Term positions (for snippet generation)
- PageRank score
- URL metadata (age, HTTPS, mobile-friendly)
Uses Bloom filters & skip lists for efficient lookup

✅ 3.5. Document Ranking Engine

Rank Signal	Source
Textual Relevance	BM25, TF-IDF
Link Analysis	PageRank
Freshness	Recency of updates
User Behavior	Click-through rate (CTR), dwell time
BERT Embeddings	For semantic matching
Location, Device	Geo-targeting, mobile optimization
Personalization	Past searches, history, interests
Spam/Quality Score	Heuristics, ML filters (Panda, Penguin)

✅ 3.6. Snippet Generation

Extract relevant sentence from page
Bold matched terms
Truncate to fixed pixel width
Uses term position metadata from index
May use ML models for better snippets

✅ 3.7. Ad Insertion Pipeline (Optional)

Runs in parallel with organic results
Ad auction occurs in real-time (~100ms)
Based on:
- Keyword match
- Bid price
- Ad Quality Score
- User intent
Results blended into final page
Ads clearly marked as “sponsored”

✅ 3.8. Final Blending & Rendering

Merge organic results + ads + answer boxes + knowledge graph
Inline rendering: weather, news, stocks, calculator
Apply UI ranking policies (e.g., “people also ask”, sitelinks)
Render HTML/snippet blocks
Send back to client — typically <200ms

🔁 Offline Systems (Preprocessing & Training)

🕷️ Web Crawler (Googlebot)

Function	Description
URL Frontier	Queue of URLs to fetch (BFS + heuristics)
Politeness Layer	Obeys robots.txt, rate-limits domains
Fetcher	Downloads HTML, JS, CSS, PDF, etc.
Parser	Extracts text, links, canonical URLs
Duplicate Detection	Shingling + MinHash
Content Classifier	NSFW, spam, quality, topic
Data Stored	In GFS / Colossus (Google’s distributed FS)

🔧 Indexing & Sharding

Uses MapReduce to process millions of pages in parallel
Produces:
- Inverted index
- Forward index (docID → title, meta, anchors)
- Term statistics
Outputs stored in SSTables / Bigtable / Spanner
Shards based on term hash / language / topic

📊 Link Graph (PageRank)

Directed graph of links (edges: A → B)
PageRank = probability a random surfer lands on page
Also computes:
- TrustRank
- Topic-sensitive rank
- Anchor text statistics
Updated periodically with new crawls

🧠 ML Training Pipelines

Click Model Training: Learn from clicks, skips, bounce
RankNet / LambdaRank: Learning to Rank models
Transformer Models: BERT, MUM, etc. for semantic ranking
Trained offline, served in real-time via optimized inference

🌍 5. Deployment & Global Distribution

Component	Strategy
Index Servers	Sharded by term hash, replicated globally
Search Frontends	Edge locations close to user
CDNs	Cache hot queries and featured snippets
Fresh Index Nodes	Continuously update recently crawled content
Knowledge Graph Nodes	Serve fact-based entities and relationships
Ad Auctions	Per-region real-time bidders

🔐 6. Security & Abuse Handling

Concern	Protection
Spammy Pages	Classifier filters during crawl & rank
Malicious Sites	Safe Browsing detection
Query Spam	Rate limiting, abuse heuristics
Click Fraud	Ads click models detect anomalies
HTTPS Preference	Higher rank for HTTPS
Data Privacy	Personalization opt-out, anonymization

📊 Text-Based System Diagram

+------------------+     +---------------------+     +-----------------------------+
|    User Query    | --> |  Search Frontend    | --> | Query Understanding Engine  |
+------------------+     +---------------------+     +-------------+---------------+
                                                           |
                                                           v
                                                +-----------------------+
                                                |     Index Lookup      |  <--->  [Inverted Index Shards]
                                                +-----------------------+
                                                           |
                                                           v
                                            +-------------------------------+
                                            |    Ranking Engine (ML + Heur) |
                                            +-------------------------------+
                                                           |
                         +----------------------+---------+---------+------------------------+
                         |                      |                   |                        |
                         v                      v                   v                        v
                [Organic Results]     [Knowledge Panel]    [Ad Engine]           [Featured Snippets]
                         \___________________ Final Page Renderer _____________________/
                                              |
                                              v
                                    +--------------------+
                                    | Return to Browser  |
                                    +--------------------+

                    [Offline]            [Offline]               [Offline]           [Online]
               +----------------+  +---------------------+  +-----------------+  +-------------------+
               | Web Crawler    |  |  Indexing Pipelines |  | Rank Model Train|  | Fresh Index Updater|
               +----------------+  +---------------------+  +-----------------+  +-------------------+