Sourabh Sonker — Data Scientist

// about

The Story

I graduated from Delhi Technological University in 2024 with a B.Tech in Mechanical Engineering (CGPA: 7.86). In my final year, I made a deliberate call to skip campus placements and pivot to Data Science — betting on skills over credentials.

Since then, I've built 5 production-deployed projects covering the full modern DS stack: classical ML with explainability, NLP with transformer models, time series forecasting, LLM-powered RAG systems, and an MLOps pipeline with Docker and CI/CD. Every project is live, measurable, and built to production standards.

My skills are domain-agnostic — the same tools that power churn prediction work for retention in EdTech; financial forecasting techniques apply to any sequential data problem. I have a particular interest in Finance and FinTech, but I bring full-stack DS capability to any data-intensive product.

Education

B.Tech Mech. Engg. · DTU · 2024

CGPA

7.81 / 10.0

Location

Delhi, India · Open to Remote · PAN India

Domain Interest

Finance · FinTech · Analytics

Skills Apply To

Any data-intensive domain

Availability

Immediately Available

// projects

Featured Projects

01 / 05

BEGINNERML · CLASSIFICATION

Bank Customer Churn Prediction & Explainability

Full ML lifecycle on 10,000 bank customer records — imbalanced data handling, Bayesian hyperparameter tuning, and SHAP-powered per-customer explainability. Business framing throughout: quantifying the real cost of missing a churn event.

86.9%

XGBoost AUC-ROC

+3.6 pts

Optuna gain

4:1

Imbalance handled

Retention levers found

Benchmarked 3 models: XGBoost (86.9%) · RF (86.2%) · LR (77.2%) with stratified evaluation; Optuna Bayesian tuning (50 trials) added +3.6 pts over untuned baseline
Engineered 4 domain features — products_per_tenure ranked #2 in SHAP importance, capturing the non-linear 3–4 product churn cliff
SHAP TreeExplainer identified top 5 levers: age 41–60, Germany geography, inactive membership, mono-product high-balance holders, credit score <600

PythonXGBoostscikit-learnSMOTESHAPOptunaStreamlit

⌥ GitHub ↗ Live App

02 / 05

INTERMEDIATENLP · TRANSFORMERS

Financial News Sentiment Analysis + Stock Correlation

Financial News Sentiment Analyser Interface

Finance-domain NLP using FinBERT — a BERT model pre-trained on financial text. Goes beyond accuracy: ablation studies, a caught evaluation bug, and a statistically honest sentiment-price correlation study across 5 tickers and 2 years of data.

75.1%

FinBERT accuracy

4.3×

Bearish F1 gain

5,842

Headlines

<1 sec

Tab load time

FinBERT zero-shot: 4.3× improvement in bearish detection (F1: 0.14 → 0.60) over TF-IDF+SVM; domain-aware preprocessing validated via ablation study
Identified and corrected a label-misalignment bug that inflated VADER baseline by 25.3 percentage points — caught through EDA notebook review
4-tab Streamlit dashboard; ~1hr cold-start resolved by precomputing FinBERT artifacts; yfinance rate-limiting fixed via precomputed CSVs

PythonFinBERTHuggingFacePyTorchscikit-learnyfinancePlotlyStreamlit

⌥ GitHub ↗ Live App

03 / 05

INTERMEDIATETIME SERIES

Retail + NSE Stock Time Series Forecasting

Dual-domain forecasting — 3M+ rows of retail data plus live NSE stock prices. Every modelling decision is documented and justified: why SARIMA beat Prophet here, what the earthquake spike means, why stock MAPE is higher than retail MAPE.

8.8%

SARIMA MAPE

r = −0.47

Oil-sales corr.

3M+

Rows processed

NSE tickers live

SARIMA: 8.8% MAPE on 90-day retail horizon; Prophet 12.1% — advantage absent in holiday-free Jun–Aug 2017 test window (documented with reasoning)
Macro signal: WTI oil price r = −0.47 with Ecuador grocery sales; April 2016 earthquake modelled as named Prophet event (2× normal Saturday demand)
Extended to live NSE stocks; ARIMA(0,1,1) order consistent with weak-form EMH — documented as insight, not buried; Streamlit app supporting 10 tickers

PythonStatsmodelspmdarimaProphetyfinancePlotlyStreamlit

⌥ GitHub ↗ Live App

04 / 05

ADVANCEDMLOPS · CI/CD

House Price Prediction — End-to-End MLOps Pipeline

House Price Prediction API Documentation Interface

Production ML engineering from pipeline design to live deployment. A leak-proof sklearn pipeline, MLflow experiment tracking, FastAPI REST service, Docker container, and GitHub Actions CI/CD that auto-deploys on every push — with failing tests blocking deployment.

0.1209

XGBoost CV RMSLE

Models tracked

5-fold

Cross-validation

Live

API on Render

ColumnTransformer fit exclusively on training folds — prevents data leakage; MLflow tracking: XGBoost selected at CV RMSLE 0.1209 over LightGBM (0.1284), Ridge (0.1392)
FastAPI with Pydantic validation and Swagger UI; multi-stage Docker build; GitHub Actions CI/CD — tests → retrain → build → deploy; failing tests block deployment
Diagnosed and resolved 5 independent production issues including CI mock scoping and Docker image bloat

Pythonscikit-learnXGBoostMLflowFastAPIDockerGitHub ActionsRender

⌥ GitHub ↗ Live API

05 / 05

ADVANCEDLLM · RAG · GenAI

RAG-Based Document Q&A System

End-to-end Retrieval-Augmented Generation pipeline. Upload any PDF or text document, ask questions in natural language, get answers grounded strictly in the source content — with multi-turn conversation memory and zero hallucination by design.

384-dim

FAISS vectors

LLM backends

Zero

Hallucinations

Multi-turn

Conversation

PyPDF2 → RecursiveCharacterTextSplitter (800-char/100-overlap) → all-MiniLM-L6-v2 embeddings → FAISS vector store for semantic retrieval
Dual LLM backends: Google Gemini 2.5 Flash + Groq Llama 3 via LangChain; custom anti-hallucination system prompt constrains answers to retrieved context only
ConversationBufferMemory for multi-turn Q&A; @st.cache_resource model caching eliminates per-request reload; deployed on Streamlit Cloud

PythonLangChainFAISSHuggingFaceGemini 2.5 FlashGroq Llama 3StreamlitPyPDF2

⌥ GitHub ↗ Live App

Beyond these 5 featured projects, I have built 20+ additional projects across SQL, Excel analytics, Power BI dashboards, Python automation, and ML pipelines. View GitHub ↗

SQL

Window Functions

CTEs & Subqueries

Business Analytics

Data Cleaning

Excel & BI

Power Query

Pivot Dashboards

Power BI (DAX)

Financial Models

Python Analytics

EDA Pipelines

Automation Scripts

Data Visualisation

Statistical Analysis

ML & MLOps

Classification & Regression

NLP Pipelines

AWS Deployments

Experiment Tracking

Sourabh
Sonker

The Story

Featured Projects

Bank Customer Churn Prediction & Explainability

Financial News Sentiment Analysis + Stock Correlation

Retail + NSE Stock Time Series Forecasting

House Price Prediction — End-to-End MLOps Pipeline

RAG-Based Document Q&A System

Technical Stack

Learning & Credentials

Let's Talk

SourabhSonker

The Story

Featured Projects

Bank Customer Churn Prediction & Explainability

Financial News Sentiment Analysis + Stock Correlation

Retail + NSE Stock Time Series Forecasting

House Price Prediction — End-to-End MLOps Pipeline

RAG-Based Document Q&A System

Technical Stack

Learning & Credentials

Let's Talk

Sourabh
Sonker