Real-world data engineering projects spanning enterprise RAG/LLM pipelines, real-time streaming on Kafka + AWS, and containerized PySpark workflows.
Designed an end-to-end Retrieval-Augmented Generation (RAG) pipeline for regulated enterprise use cases, covering document ingestion, embedding generation, vector indexing, semantic retrieval, and LLM response orchestration.
Architecture & Workflow
- Document ingestion and chunking pipeline with embedding generation
- Vector indexing and semantic retrieval for grounded LLM responses
- Governance guardrails and deterministic escalation logic for compliance
- Streamlit-based evaluation interface for prompt/response review
What This Demonstrates
- End-to-end LLM pipeline design with retrieval grounding
- Awareness of governance and risk controls in regulated environments
- Practical evaluation tooling for non-deterministic systems
Tech Stack: Python, LLMs, Vector Embeddings, Semantic Search, Streamlit
Built a streaming ingestion pipeline using Apache Kafka (producer/consumer architecture) for near-real-time event processing, with an S3 data lake, AWS Glue catalog, and SQL analytics via Amazon Athena.
Architecture & Workflow
- Real-time data ingestion using Kafka producers and consumers built in Python
- Kafka infrastructure hosted on AWS EC2
- Streaming data persisted in Amazon S3 as a scalable data lake
- Schema discovery and management using AWS Glue Crawler & Data Catalog
- Analytical querying performed with Amazon Athena using SQL
What This Demonstrates
- Hands-on experience with real-time streaming systems
- Cloud-native data lake architecture design
- Practical understanding of ingestion, storage, and query layers
- Ability to design scalable, decoupled data pipelines
Tech Stack: Apache Kafka, Python, AWS (EC2, S3, Glue, Athena), SQL
Built a production-grade local PySpark environment using Docker, resolving real Python version mismatches and container dependency ordering issues — demonstrating hands-on containerization experience for data engineering workflows.
Architecture & Workflow
- Multi-container Docker setup for PySpark + Jupyter + supporting services
- Pinned Python and Spark versions resolving runtime compatibility conflicts
- Reproducible local environment for PySpark development and testing
What This Demonstrates
- Practical containerization of distributed data tooling
- Debugging runtime/version conflicts in Spark ecosystems
- Reproducible engineering workflows that mirror production setups
Tech Stack: Docker, PySpark, Python, Jupyter, Shell
More projects coming soon...