🔍 DeepGit 2.0 — ColBERT-Powered, Hardware-Aware & Ready to Dig

DeepGit 2.0 elevates repository discovery with ColBERT embeddings, hardware-aware filtering, and faster re-ranking


Community Article · Published April 18, 2025 · by zamal

GitHub’s great… until you actually have to find something.
Stars are a popularity contest, keywords are brittle, and half the repos you open can’t even run on your machine.
DeepGit 2.0 fixes that by treating GitHub like a research corpus instead of a social feed.


DeepGit 2.0 is an advanced, LangGraph-based agentic workflow designed to perform deep research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent—even uncovering less-known but highly relevant tools. By infusing hybrid dense retrieval with ColBERT v2 embeddings, advanced cross-encoder re-ranking, hardware-aware dependency filtering, and comprehensive activity analysis, DeepGit 2.0 delivers a unified, open-source platform for intelligent repository discovery.


DeepGit Dashboard Screenshot

đź§© What Makes DeepGit Different?

Pain on vanilla GitHubDeepGit’s antidote
Infinite scrolling through star-inflated, outdated projectsColBERT v2 semantic retrieval – token-level MaxSim pulls conceptually relevant repos, not just fuzzy keyword hits
README looks good… until pip install diesHardware-aware dependency filter – agent reads requirements.txt/pyproject.toml and drops repos that need a GPU
One metric (stars) ≠ qualityMulti-factor ranking – cross-encoder similarity, code-quality heuristics, commit cadence & community health blended
Time sink: clicking, reading, guessingTabulated results with similarity %, hardware badge, and one-line justification – decide in seconds

🚀 What’s New in 2.0?

DeepGit 2.0

UpgradeWhy it matters
âš› ColBERT-v2 embeddingsLate-interaction vectors capture phrase-level context; surfaces hidden gems that single-vector models miss
🔩 Hardware-aware filterAdd cpu-only, low-memory or mobile to your query – the agent prunes heavyweight repos automatically
⚡ Faster cross-encoderMiniLM-L-6-v2 keeps passage-level accuracy while chopping latency

đź›  Inside the Agentic Pipeline

Query: “Fast Rust JSON parser that runs on cpu-only”

StageBehind the curtain
1. Query ExpansionLLM rewrites to json-parser:rust:target-cpu
2. Hardware Detection“cpu-only” recorded as a constraint
3. ColBERT Retrieval280 repos scored via MaxSim over README & docs
4. Cross-Encoder Re-rankTop-K rescored → 60 remain
5. Dependency FilterModel reads Cargo.toml & drops crates requiring CUDA
6. Insight MergeAdds stars, forks, issue velocity, code smells
7. OutputTable with similarity %, CE-score, and âś… Runs on cpu-only badge

DeepGit Results Screenshot

🔬 Technical Highlights

  • LangGraph orchestration – each tool is a node; loops until convergence
  • ColBERT-v2 – pulled from colbert-ir/colbertv2.0, runs CPU or GPU
  • Cross-Encoder – cross-encoder/ms-marco-MiniLM-L-6-v2 for precise re-ranking
  • Dependency reasoning – the agent asks “Can this dependency list run on my hardware?” and acts accordingly

🎯 Goals

  • Uncover Hidden Gems: Surface powerful but under-the-radar open-source tools, now with hardware spec filtering.
  • Empower Research: Build an intelligent discovery layer over GitHub tailored for research-focused developers.
  • Promote Open Innovation: Open-source the entire workflow to foster transparency and collaboration.

đź§Ş Try It Yourself

Zero-GPU demo
👉 Hugging Face Space – DeepGit-lite

Full local run

git clone https://github.com/zamalali/DeepGit.git
cd DeepGit
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt
export GITHUB_API_KEY=<your_token>
python app.py