💼 My Projects

Real-world data engineering projects spanning enterprise RAG/LLM pipelines, real-time streaming on Kafka + AWS, and containerized PySpark workflows.

Enterprise RAG System — Generative AI Capstone

Role: Data & AI Engineer · 2025

Designed an end-to-end Retrieval-Augmented Generation (RAG) pipeline for regulated enterprise use cases, covering document ingestion, embedding generation, vector indexing, semantic retrieval, and LLM response orchestration.

Architecture & Workflow

  • Document ingestion and chunking pipeline with embedding generation
  • Vector indexing and semantic retrieval for grounded LLM responses
  • Governance guardrails and deterministic escalation logic for compliance
  • Streamlit-based evaluation interface for prompt/response review

What This Demonstrates

  • End-to-end LLM pipeline design with retrieval grounding
  • Awareness of governance and risk controls in regulated environments
  • Practical evaluation tooling for non-deterministic systems
Tech Stack: Python, LLMs, Vector Embeddings, Semantic Search, Streamlit

Real-Time Streaming Data Pipeline — Kafka + AWS

Role: Data Engineer · 2024

Built a streaming ingestion pipeline using Apache Kafka (producer/consumer architecture) for near-real-time event processing, with an S3 data lake, AWS Glue catalog, and SQL analytics via Amazon Athena.

Architecture & Workflow

  • Real-time data ingestion using Kafka producers and consumers built in Python
  • Kafka infrastructure hosted on AWS EC2
  • Streaming data persisted in Amazon S3 as a scalable data lake
  • Schema discovery and management using AWS Glue Crawler & Data Catalog
  • Analytical querying performed with Amazon Athena using SQL

What This Demonstrates

  • Hands-on experience with real-time streaming systems
  • Cloud-native data lake architecture design
  • Practical understanding of ingestion, storage, and query layers
  • Ability to design scalable, decoupled data pipelines
Tech Stack: Apache Kafka, Python, AWS (EC2, S3, Glue, Athena), SQL

Containerized PySpark Development Environment — Docker

Role: Data Engineer · 2026

Built a production-grade local PySpark environment using Docker, resolving real Python version mismatches and container dependency ordering issues — demonstrating hands-on containerization experience for data engineering workflows.

Architecture & Workflow

  • Multi-container Docker setup for PySpark + Jupyter + supporting services
  • Pinned Python and Spark versions resolving runtime compatibility conflicts
  • Reproducible local environment for PySpark development and testing

What This Demonstrates

  • Practical containerization of distributed data tooling
  • Debugging runtime/version conflicts in Spark ecosystems
  • Reproducible engineering workflows that mirror production setups
Tech Stack: Docker, PySpark, Python, Jupyter, Shell

More projects coming soon...