🔍 DeepGit 2.0 — ColBERT-Powered, Hardware-Aware & Ready to Dig
DeepGit 2.0 elevates repository discovery with ColBERT embeddings, hardware-aware filtering, and faster re-ranking
Community Article · Published April 18, 2025 · by zamal
GitHub’s great… until you actually have to find something.
Stars are a popularity contest, keywords are brittle, and half the repos you open can’t even run on your machine.
DeepGit 2.0 fixes that by treating GitHub like a research corpus instead of a social feed.
DeepGit 2.0 is an advanced, LangGraph-based agentic workflow designed to perform deep research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent—even uncovering less-known but highly relevant tools. By infusing hybrid dense retrieval with ColBERT v2 embeddings, advanced cross-encoder re-ranking, hardware-aware dependency filtering, and comprehensive activity analysis, DeepGit 2.0 delivers a unified, open-source platform for intelligent repository discovery.
đź§© What Makes DeepGit Different?
Pain on vanilla GitHub | DeepGit’s antidote |
---|---|
Infinite scrolling through star-inflated, outdated projects | ColBERT v2 semantic retrieval – token-level MaxSim pulls conceptually relevant repos, not just fuzzy keyword hits |
README looks good… until pip install dies | Hardware-aware dependency filter – agent reads requirements.txt /pyproject.toml and drops repos that need a GPU |
One metric (stars) ≠quality | Multi-factor ranking – cross-encoder similarity, code-quality heuristics, commit cadence & community health blended |
Time sink: clicking, reading, guessing | Tabulated results with similarity %, hardware badge, and one-line justification – decide in seconds |
🚀 What’s New in 2.0?
Upgrade | Why it matters |
---|---|
âš› ColBERT-v2 embeddings | Late-interaction vectors capture phrase-level context; surfaces hidden gems that single-vector models miss |
🔩 Hardware-aware filter | Add cpu-only , low-memory or mobile to your query – the agent prunes heavyweight repos automatically |
⚡ Faster cross-encoder | MiniLM-L-6-v2 keeps passage-level accuracy while chopping latency |
đź› Inside the Agentic Pipeline
Query: “Fast Rust JSON parser that runs on cpu-only”
Stage | Behind the curtain |
---|---|
1. Query Expansion | LLM rewrites to json-parser:rust:target-cpu |
2. Hardware Detection | “cpu-only” recorded as a constraint |
3. ColBERT Retrieval | 280 repos scored via MaxSim over README & docs |
4. Cross-Encoder Re-rank | Top-K rescored → 60 remain |
5. Dependency Filter | Model reads Cargo.toml & drops crates requiring CUDA |
6. Insight Merge | Adds stars, forks, issue velocity, code smells |
7. Output | Table with similarity %, CE-score, and âś… Runs on cpu-only badge |
🔬 Technical Highlights
- LangGraph orchestration – each tool is a node; loops until convergence
- ColBERT-v2 – pulled from
colbert-ir/colbertv2.0
, runs CPU or GPU - Cross-Encoder –
cross-encoder/ms-marco-MiniLM-L-6-v2
for precise re-ranking - Dependency reasoning – the agent asks “Can this dependency list run on my hardware?” and acts accordingly
🎯 Goals
- Uncover Hidden Gems: Surface powerful but under-the-radar open-source tools, now with hardware spec filtering.
- Empower Research: Build an intelligent discovery layer over GitHub tailored for research-focused developers.
- Promote Open Innovation: Open-source the entire workflow to foster transparency and collaboration.
đź§Ş Try It Yourself
Zero-GPU demo
👉 Hugging Face Space – DeepGit-lite
Full local run
git clone https://github.com/zamalali/DeepGit.git
cd DeepGit
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
export GITHUB_API_KEY=<your_token>
python app.py