Projects - Bhairus

Real-world data engineering projects spanning enterprise RAG/LLM pipelines, real-time streaming on Kafka + AWS, and containerized PySpark workflows.

Enterprise RAG System — Generative AI Capstone

Role: Data & AI Engineer · 2025

Designed an end-to-end Retrieval-Augmented Generation (RAG) pipeline for regulated enterprise use cases, covering document ingestion, embedding generation, vector indexing, semantic retrieval, and LLM response orchestration.

Architecture & Workflow

Document ingestion and chunking pipeline with embedding generation
Vector indexing and semantic retrieval for grounded LLM responses
Governance guardrails and deterministic escalation logic for compliance
Streamlit-based evaluation interface for prompt/response review

What This Demonstrates

End-to-end LLM pipeline design with retrieval grounding
Awareness of governance and risk controls in regulated environments
Practical evaluation tooling for non-deterministic systems

Tech Stack: Python, LLMs, Vector Embeddings, Semantic Search, Streamlit

View on GitHub

Real-Time Streaming Data Pipeline — Kafka + AWS

Role: Data Engineer · 2024

Built a streaming ingestion pipeline using Apache Kafka (producer/consumer architecture) for near-real-time event processing, with an S3 data lake, AWS Glue catalog, and SQL analytics via Amazon Athena.

Architecture & Workflow

Real-time data ingestion using Kafka producers and consumers built in Python
Kafka infrastructure hosted on AWS EC2
Streaming data persisted in Amazon S3 as a scalable data lake
Schema discovery and management using AWS Glue Crawler & Data Catalog
Analytical querying performed with Amazon Athena using SQL

What This Demonstrates

Hands-on experience with real-time streaming systems
Cloud-native data lake architecture design
Practical understanding of ingestion, storage, and query layers
Ability to design scalable, decoupled data pipelines

Tech Stack: Apache Kafka, Python, AWS (EC2, S3, Glue, Athena), SQL

View on GitHub

Containerized PySpark Development Environment — Docker

Role: Data Engineer · 2026

Built a production-grade local PySpark environment using Docker, resolving real Python version mismatches and container dependency ordering issues — demonstrating hands-on containerization experience for data engineering workflows.

Architecture & Workflow

Multi-container Docker setup for PySpark + Jupyter + supporting services
Pinned Python and Spark versions resolving runtime compatibility conflicts
Reproducible local environment for PySpark development and testing

What This Demonstrates

Practical containerization of distributed data tooling
Debugging runtime/version conflicts in Spark ecosystems
Reproducible engineering workflows that mirror production setups

Tech Stack: Docker, PySpark, Python, Jupyter, Shell

View on GitHub

More projects coming soon...

💼 My Projects

Enterprise RAG System — Generative AI Capstone

Architecture & Workflow

What This Demonstrates

Real-Time Streaming Data Pipeline — Kafka + AWS

Architecture & Workflow

What This Demonstrates

Containerized PySpark Development Environment — Docker

Architecture & Workflow

What This Demonstrates