Data Scientist

Sourabh
Sonker

Building production-grade ML, NLP, Time Series, and LLM systems — from raw data to deployed, measurable results.

// Domain interest: Finance · FinTech · Analytics-driven products

Machine Learning NLP · Transformers Time Series LLM · RAG MLOps · CI/CD Data Analysis
86.9%
AUC-ROC
8.8%
MAPE (Retail)
4.3×
Bearish F1 Gain
5
Live Deployments
// about

The Story

I graduated from Delhi Technological University in 2024 with a B.Tech in Mechanical Engineering (CGPA: 7.86). In my final year, I made a deliberate call to skip campus placements and pivot to Data Science — betting on skills over credentials.

Since then, I've built 5 production-deployed projects covering the full modern DS stack: classical ML with explainability, NLP with transformer models, time series forecasting, LLM-powered RAG systems, and an MLOps pipeline with Docker and CI/CD. Every project is live, measurable, and built to production standards.

My skills are domain-agnostic — the same tools that power churn prediction work for retention in EdTech; financial forecasting techniques apply to any sequential data problem. I have a particular interest in Finance and FinTech, but I bring full-stack DS capability to any data-intensive product.

Education
B.Tech Mech. Engg. · DTU · 2024
CGPA
7.81 / 10.0
Location
Delhi, India · Open to Remote · PAN India
Domain Interest
Finance · FinTech · Analytics
Skills Apply To
Any data-intensive domain
Availability
Immediately Available
// projects

Featured Projects

01 / 05
BEGINNERML · CLASSIFICATION

Bank Customer Churn Prediction & Explainability

Bank Customer Churn Predictor Interface

Full ML lifecycle on 10,000 bank customer records — imbalanced data handling, Bayesian hyperparameter tuning, and SHAP-powered per-customer explainability. Business framing throughout: quantifying the real cost of missing a churn event.

86.9%
XGBoost AUC-ROC
+3.6 pts
Optuna gain
4:1
Imbalance handled
5
Retention levers found
  • Benchmarked 3 models: XGBoost (86.9%) · RF (86.2%) · LR (77.2%) with stratified evaluation; Optuna Bayesian tuning (50 trials) added +3.6 pts over untuned baseline
  • Engineered 4 domain features — products_per_tenure ranked #2 in SHAP importance, capturing the non-linear 3–4 product churn cliff
  • SHAP TreeExplainer identified top 5 levers: age 41–60, Germany geography, inactive membership, mono-product high-balance holders, credit score <600
PythonXGBoostscikit-learnSMOTESHAPOptunaStreamlit
02 / 05
INTERMEDIATENLP · TRANSFORMERS

Financial News Sentiment Analysis + Stock Correlation

Financial News Sentiment Analyser Interface

Finance-domain NLP using FinBERT — a BERT model pre-trained on financial text. Goes beyond accuracy: ablation studies, a caught evaluation bug, and a statistically honest sentiment-price correlation study across 5 tickers and 2 years of data.

75.1%
FinBERT accuracy
4.3×
Bearish F1 gain
5,842
Headlines
<1 sec
Tab load time
  • FinBERT zero-shot: 4.3× improvement in bearish detection (F1: 0.14 → 0.60) over TF-IDF+SVM; domain-aware preprocessing validated via ablation study
  • Identified and corrected a label-misalignment bug that inflated VADER baseline by 25.3 percentage points — caught through EDA notebook review
  • 4-tab Streamlit dashboard; ~1hr cold-start resolved by precomputing FinBERT artifacts; yfinance rate-limiting fixed via precomputed CSVs
PythonFinBERTHuggingFacePyTorchscikit-learnyfinancePlotlyStreamlit
03 / 05
INTERMEDIATETIME SERIES

Retail + NSE Stock Time Series Forecasting

NSE Stock Forecaster Interface

Dual-domain forecasting — 3M+ rows of retail data plus live NSE stock prices. Every modelling decision is documented and justified: why SARIMA beat Prophet here, what the earthquake spike means, why stock MAPE is higher than retail MAPE.

8.8%
SARIMA MAPE
r = −0.47
Oil-sales corr.
3M+
Rows processed
10
NSE tickers live
  • SARIMA: 8.8% MAPE on 90-day retail horizon; Prophet 12.1% — advantage absent in holiday-free Jun–Aug 2017 test window (documented with reasoning)
  • Macro signal: WTI oil price r = −0.47 with Ecuador grocery sales; April 2016 earthquake modelled as named Prophet event (2× normal Saturday demand)
  • Extended to live NSE stocks; ARIMA(0,1,1) order consistent with weak-form EMH — documented as insight, not buried; Streamlit app supporting 10 tickers
PythonStatsmodelspmdarimaProphetyfinancePlotlyStreamlit
04 / 05
ADVANCEDMLOPS · CI/CD

House Price Prediction — End-to-End MLOps Pipeline

House Price Prediction API Documentation Interface

Production ML engineering from pipeline design to live deployment. A leak-proof sklearn pipeline, MLflow experiment tracking, FastAPI REST service, Docker container, and GitHub Actions CI/CD that auto-deploys on every push — with failing tests blocking deployment.

0.1209
XGBoost CV RMSLE
4
Models tracked
5-fold
Cross-validation
Live
API on Render
  • ColumnTransformer fit exclusively on training folds — prevents data leakage; MLflow tracking: XGBoost selected at CV RMSLE 0.1209 over LightGBM (0.1284), Ridge (0.1392)
  • FastAPI with Pydantic validation and Swagger UI; multi-stage Docker build; GitHub Actions CI/CD — tests → retrain → build → deploy; failing tests block deployment
  • Diagnosed and resolved 5 independent production issues including CI mock scoping and Docker image bloat
Pythonscikit-learnXGBoostMLflowFastAPIDockerGitHub ActionsRender
05 / 05
ADVANCEDLLM · RAG · GenAI

RAG-Based Document Q&A System

RAG Document Q&A App Interface

End-to-end Retrieval-Augmented Generation pipeline. Upload any PDF or text document, ask questions in natural language, get answers grounded strictly in the source content — with multi-turn conversation memory and zero hallucination by design.

384-dim
FAISS vectors
2
LLM backends
Zero
Hallucinations
Multi-turn
Conversation
  • PyPDF2 → RecursiveCharacterTextSplitter (800-char/100-overlap) → all-MiniLM-L6-v2 embeddings → FAISS vector store for semantic retrieval
  • Dual LLM backends: Google Gemini 2.5 Flash + Groq Llama 3 via LangChain; custom anti-hallucination system prompt constrains answers to retrieved context only
  • ConversationBufferMemory for multi-turn Q&A; @st.cache_resource model caching eliminates per-request reload; deployed on Streamlit Cloud
PythonLangChainFAISSHuggingFaceGemini 2.5 FlashGroq Llama 3StreamlitPyPDF2
Beyond these 5 featured projects, I have built 20+ additional projects across SQL, Excel analytics, Power BI dashboards, Python automation, and ML pipelines. View GitHub ↗
SQL
Window Functions
CTEs & Subqueries
Business Analytics
Data Cleaning
Excel & BI
Power Query
Pivot Dashboards
Power BI (DAX)
Financial Models
Python Analytics
EDA Pipelines
Automation Scripts
Data Visualisation
Statistical Analysis
ML & MLOps
Classification & Regression
NLP Pipelines
AWS Deployments
Experiment Tracking
// skills

Technical Stack

Languages & Data
PythonSQL PandasNumPy ExcelPlotly Power BIyfinance
Machine Learning
XGBoostLightGBM scikit-learnSMOTE SHAPOptuna ProphetSARIMA PyTorch
LLM / NLP
LangChainFAISS HuggingFaceFinBERT Sentence-Transformers Gemini 2.5 Flash Groq Llama 3NLTK
MLOps & Deployment
FastAPIDocker GitHub ActionsMLflow RenderStreamlit PydanticCI/CD
// certifications

Learning & Credentials

Hugging Face
AI Agents Course
Verified Certificate
View Certificate ↗
University of Amsterdam · Coursera
Basic Statistics
Verified Certificate
View Certificate ↗
Macquarie University · Coursera
Excel Skills for Business: Essentials
Verified Certificate
View Certificate ↗
Macquarie University · Coursera
Excel Skills for Business: Advanced
Verified Certificate
View Certificate ↗
Kaggle · 10 Micro-Course Certificates
Python · Pandas · Data Visualisation · SQL Intro · SQL Advanced · Data Cleaning · Feature Engineering · Machine Learning · Deep Learning · ML Explainability · Time Series
All Verified
View All Certificates ↗
// contact

Let's Talk

Open to Data Scientist and Analyst roles at startups and growth-stage companies. Strong preference for data-intensive products — Finance, FinTech, EdTech, HealthTech, SaaS analytics. Available immediately.