🔍 DeepGit 2.0 — ColBERT-Powered, Hardware-Aware & Ready to Dig

DeepGit ColBERT LangGraph AI GitHub Open Source Hardware-aware

DeepGit 2.0 elevates repository discovery with ColBERT embeddings, hardware-aware filtering, and faster re-ranking

Community Article · Published April 18, 2025 · by zamal

GitHub’s great… until you actually have to find something.
Stars are a popularity contest, keywords are brittle, and half the repos you open can’t even run on your machine.
DeepGit 2.0 fixes that by treating GitHub like a research corpus instead of a social feed.

DeepGit 2.0 is an advanced, LangGraph-based agentic workflow designed to perform deep research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent—even uncovering less-known but highly relevant tools. By infusing hybrid dense retrieval with ColBERT v2 embeddings, advanced cross-encoder re-ranking, hardware-aware dependency filtering, and comprehensive activity analysis, DeepGit 2.0 delivers a unified, open-source platform for intelligent repository discovery.

DeepGit Dashboard Screenshot

🧩 What Makes DeepGit Different?

Pain on vanilla GitHub	DeepGit’s antidote
Infinite scrolling through star-inflated, outdated projects	ColBERT v2 semantic retrieval – token-level MaxSim pulls conceptually relevant repos, not just fuzzy keyword hits
README looks good… until `pip install` dies	Hardware-aware dependency filter – agent reads `requirements.txt`/`pyproject.toml` and drops repos that need a GPU
One metric (stars) ≠ quality	Multi-factor ranking – cross-encoder similarity, code-quality heuristics, commit cadence & community health blended
Time sink: clicking, reading, guessing	Tabulated results with similarity %, hardware badge, and one-line justification – decide in seconds

🚀 What’s New in 2.0?

DeepGit 2.0

Upgrade	Why it matters
⚛ ColBERT-v2 embeddings	Late-interaction vectors capture phrase-level context; surfaces hidden gems that single-vector models miss
🔩 Hardware-aware filter	Add `cpu-only`, `low-memory` or `mobile` to your query – the agent prunes heavyweight repos automatically
⚡ Faster cross-encoder	`MiniLM-L-6-v2` keeps passage-level accuracy while chopping latency

🛠 Inside the Agentic Pipeline

Query: “Fast Rust JSON parser that runs on cpu-only”

Stage	Behind the curtain
1. Query Expansion	LLM rewrites to `json-parser:rust:target-cpu`
2. Hardware Detection	“cpu-only” recorded as a constraint
3. ColBERT Retrieval	280 repos scored via MaxSim over README & docs
4. Cross-Encoder Re-rank	Top-K rescored → 60 remain
5. Dependency Filter	Model reads `Cargo.toml` & drops crates requiring CUDA
6. Insight Merge	Adds stars, forks, issue velocity, code smells
7. Output	Table with similarity %, CE-score, and ✅ Runs on cpu-only badge

DeepGit Results Screenshot

🔬 Technical Highlights

LangGraph orchestration – each tool is a node; loops until convergence
ColBERT-v2 – pulled from colbert-ir/colbertv2.0, runs CPU or GPU
Cross-Encoder – cross-encoder/ms-marco-MiniLM-L-6-v2 for precise re-ranking
Dependency reasoning – the agent asks “Can this dependency list run on my hardware?” and acts accordingly

🎯 Goals

Uncover Hidden Gems: Surface powerful but under-the-radar open-source tools, now with hardware spec filtering.
Empower Research: Build an intelligent discovery layer over GitHub tailored for research-focused developers.
Promote Open Innovation: Open-source the entire workflow to foster transparency and collaboration.

🧪 Try It Yourself

Zero-GPU demo
👉 Hugging Face Space – DeepGit-lite

Full local run

git clone https://github.com/zamalali/DeepGit.git
cd DeepGit
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt
export GITHUB_API_KEY=<your_token>
python app.py