Open to new opportunities · Austin, TX

Dharmic Reddy Meka_

AI Data Engineer · Sr Data Analyst @ UT Austin

I build data pipelines, lakehouses, and AI retrieval systems that power analytics at scale. Currently shipping BI infrastructure on Microsoft Fabric for a $2B+ capital projects portfolio at UT Austin. Previously at Accenture (UBS) and Oklahoma State University.

View my work GitHub LinkedIn

about

I'm a data engineer focused on the infrastructure that makes analytics and AI possible. At UT Austin, I architect governed lakehouses on Microsoft Fabric for a $2B+ capital projects portfolio and ship the Azure Data Factory pipelines that feed them.

Before this, I built Databricks + PySpark ETL at Accenture (UBS) that serves 200+ business users, and engineered an end-to-end NLP pipeline at Oklahoma State that processed 50K+ news articles for election research.

Lately I've been going deep on vector search and retrieval systems. My latest side project is a production-shaped RAG pipeline over arXiv papers using Postgres + pgvector. Targeting AI Data Engineer and Data Engineer roles next.

locationAustin, Texas

currentSr Data Analyst · UT Austin

focusData Engineering · AI retrieval

experience3+ years production systems

statusOpen to opportunities

99.5%

Pipeline reliability across production Databricks + ADF workflows

87%

ETL runtime cut from 6+ hrs to 45 min via ADF + PySpark orchestration

$2B+

Capital portfolio served by a governed Fabric lakehouse

50K+

Articles processed through end-to-end Python + NLP pipeline

work

Mar 2025 · Now
Current

Sr Data Analyst

The University of Texas at Austin

Planning, Design & Construction · Austin, TX

Microsoft Fabric Azure Data Factory Dataflows Gen2 Direct Lake Star Schema DAX Power Automate

Architected a governed lakehouse semantic layer on Microsoft Fabric with Direct Lake and star-schema restructuring for a $2B+ capital projects portfolio. Cut query refresh time by 60%+ and eliminated cross-workspace inconsistencies across 5 PDC teams.
Built automated ingestion pipelines on Azure Data Factory and Fabric Dataflows Gen2 consolidating financial data from 4+ source systems into a single governed reporting layer. Saved 3 hours per reporting cycle of manual data prep.
Designed dimensional models and optimized DAX using SUMMARIZECOLUMNS patterns and calculation groups. Cut ad-hoc data requests from 15 to 7 per month (53% reduction) and accelerated budget reviews by 3 days.
Migrated 15+ legacy Tableau workloads onto the governed Power BI semantic layer, standardizing KPI logic and fiscal hierarchies across teams.
Orchestrated Power Automate workflows for dataset refresh alerts, multi-stage approval routing, and exception flagging. Eliminated 5+ hours per week of manual follow-up.

Aug 2024 · Mar 2025

Data Analyst · NLP Research

Oklahoma State University

Stillwater, OK

Python BeautifulSoup Scrapy NLTK NLP Pipelines PostgreSQL Data Quality

Built an end-to-end Python data pipeline (BeautifulSoup, Scrapy) that ingested and structured 50,000+ U.S. news articles into PostgreSQL with automated data-quality checks. This formed the foundation layer for all downstream NLP work.
Designed an NLP sentiment scoring pipeline using tokenization, normalization, and lemmatization to extract candidate-level media polarity across the 2024 U.S. election cycle. Surfaced measurable bias patterns across 10+ news sources.
Delivered the analytics layer on top of this pipeline so the research team could identify statistically significant coverage disparities used in published findings.

Aug 2021 · Dec 2022

Analytics Engineer

Accenture · Financial Services

Client: UBS · Hyderabad, India

Databricks PySpark Azure Data Factory REST APIs Star Schema SQL Data Marts

Built and maintained Databricks ETL pipelines in PySpark that ingested financial data from REST APIs, relational databases, and flat files at scale, with automated validation checks and structured error handling.
Orchestrated production workflows via Azure Data Factory. Reduced average pipeline processing time from 6+ hours to under 45 minutes with 99.5% reliability across reporting cycles.
Designed star-schema dimensional models and governed data marts supporting P&L tracking, KPI reporting, and executive analytics. Created a single source of truth consumed by 200+ business users across multiple business units.
Translated complex financial requirements into a governed analytics layer adopted by senior leadership for monthly performance reviews across UBS business units.

projects

Data Engineering · Sports ML

IPL Winner Prediction

Live demo Repo

End-to-end data engineering pipeline for IPL match prediction. Built on licensed open data and official APIs only, no Terms-of-Service compromises. Ingests historical match data from Cricsheet (ODbL), venue metadata from Wikipedia (CC BY-SA), and upcoming fixtures from CricketData.org. Flows through a dbt warehouse with star schema, feeds an XGBoost classifier with probability calibration, and serves predictions through a deployed Streamlit dashboard.

What I built

dbt warehouse with bronze, silver, and gold layers. Star schema across fact_matches, fact_ball_by_ball, dim_teams, dim_venues, dim_players, plus a team_canonical seed for entity resolution.
SCD Type 2 snapshot tracking team rebrandings (e.g. RCB Bangalore to Bengaluru) so historical matches stay tied to the right entity.
Strict walk-forward modeling split (train: 2022, val: early 2023, holdout: late 2023 + 2024). XGBoost beats baseline by 9.8pp accuracy on the 102-match holdout.
Probability calibration with reliability diagram, Brier score, and ECE reported honestly in a published model card. MLflow tracks every run.
Dual-runtime orchestration: same Python entrypoints run inside both GitHub Actions (weekly cron, ephemeral Postgres service container) and a local Airflow DAG.
Live Streamlit dashboard with three pages (Predict, Calibration, Data). Reads from a SQLite snapshot bundled in the repo so the deploy is free-tier.

Tech stack

LanguagePython

WarehousePostgreSQL · dbt Core

MLscikit-learn · XGBoost · calibration · MLflow

OrchestrationGitHub Actions (prod) · Airflow (local demo)

DashboardStreamlit Community Cloud · SQLite snapshot

IngestionPython · httpx · bulk download + REST APIs

Data SourcesCricsheet (ODbL) · Wikipedia (CC BY-SA) · CricketData.org

PatternsStar Schema · SCD Type 2 · Walk-forward CV

InfraDocker · docker-compose · ephemeral PG service

XGBoost +9.8pp · 218 matches · Live Streamlit demo

View on GitHub

AI Data Engineering · Vector Search

arXiv RAG Pipeline with pgvector

Repo

A production-shaped RAG retrieval layer over recent arXiv ML papers. Not a notebook demo. Pulls papers via the arXiv API, chunks titles and abstracts with a sliding window, embeds each chunk on CPU with sentence-transformers, stores everything in Postgres + pgvector with an HNSW index, and serves semantic search behind a typed FastAPI endpoint with category filtering.

What I built

One Postgres for metadata and vectors. arXiv category filter and cosine ranking happen in a single SQL query. No separate vector DB service.
HNSW cosine index aligned with vector_cosine_ops on L2-normalized embeddings so the operator and index agree for correct ANN ranking.
psycopg3 connection pool with pgvector type adapter registered at connect time. numpy.ndarray and vector stay allocation-free on ingest and search.
Resilient arXiv ingestion: httpx + feedparser + tenacity retries, 3-second rate limiting, paginated fetch with idempotent delete-then-insert per paper.
GitHub Actions CI spins up a real pgvector/pgvector:pg16 service container to verify migrations and ANN search end to end.

Tech stack

LanguagePython 3.11

APIFastAPI · Pydantic v2

Vector DBPostgres 16 · pgvector · HNSW (cosine)

Embeddingssentence-transformers · all-MiniLM-L6-v2 (384-dim)

DB Driverpsycopg3 · connection pool · pgvector adapter

Ingestionhttpx · feedparser · tenacity retries

Configpydantic-settings · structlog

InfraDocker · docker-compose · GitHub Actions CI

Testingpytest · mocked API · pgvector integration

pgvector HNSW · 384-dim embeddings · FastAPI /search

View on GitHub

Data Engineering · Healthcare

Multi-Source Healthcare Claims Lakehouse

Repo

End-to-end Databricks lakehouse that ingests healthcare claims from three source formats: structured CSV billing, nested JSON provider records, and unstructured PDF clinical notes. Flows through a Medallion (Bronze, Silver, Gold) architecture on Delta Lake into a star schema, with a 12-chart analytics dashboard on top.

What I built

Regex-based NLP extraction from clinical notes: vitals, diagnoses, medications, follow-up windows.
SCD Type 2 on the provider dimension for point-in-time queries on specialty and network status changes.
Custom data quality framework simulating Delta Live Tables expectations (warn / drop / fail), logged to a DQ table for monitoring.
Star schema with fact_claims and 5 dimensions, referential integrity validated, powering denial-rate and network-comparison analytics.

Tech stack

LanguagePySpark · Spark SQL · Python

PlatformDatabricks · Unity Catalog

StorageDelta Lake · Managed Delta Tables

PatternsMedallion · Star Schema · SCD Type 2

DQCustom DQ framework · DLT-style expectations

VizDatabricks Dashboards · 12 chart types

3 source formats · 15 Delta tables · 12 visualizations

View on GitHub

Data Engineering · Retail

Retail Sales Analytics Pipeline

Repo

Production-style ETL pipeline on Databricks processing 541,909 real UK e-commerce transactions from the Kaggle Online Retail II dataset. Runs from raw ingestion through cleaned Silver to four aggregated Gold tables, feeding a scheduled AI/BI dashboard that refreshes daily at 06:00 UTC.

What I built

Bronze to Silver to Gold layering with Delta Lake: cancellation removal, dedup, type casting, derived revenue columns.
RFM customer segmentation surfacing 4,346 customers tiered High / Mid / Low value.
Scheduled Workflow orchestrating bronze to silver to gold as sequential dependent tasks, daily at 06:00 UTC.
4 Gold tables powering the dashboard: monthly revenue by country, top 3,896 products, RFM tiers, day-of-week sales.

Tech stack

LanguagePySpark · Python

PlatformDatabricks · Apache Spark

StorageDelta Lake

OrchestrationDatabricks Workflows · Scheduled Jobs

VizDatabricks AI/BI Dashboards

SourceKaggle API · Online Retail II (2010-2011)

PatternMedallion · RFM Segmentation

541K transactions · 4 Gold tables · Daily scheduled job

View on GitHub

skills

01 Data Engineering & Pipelines

Microsoft Fabric Databricks Azure Data Factory PySpark dbt Apache Airflow Apache Spark Dataflows Gen2 Databricks Workflows Medallion Architecture

02 Programming & Query

Python SQL PySpark Pandas T-SQL Stored Procedures Query Optimization DAX M (Power Query)

03 AI, ML & Vector Search

RAG Pattern pgvector XGBoost MLflow scikit-learn sentence-transformers HNSW Indexing Semantic Search Embeddings NLP Pipelines NLTK BeautifulSoup Scrapy FastAPI Streamlit

04 Databases & Cloud

Delta Lake PostgreSQL Snowflake pgvector SQL Server Azure Synapse OneLake AWS

05 Architecture & Modeling

Star Schema Dimensional Modeling SCD Type 2 Data Marts Data Quality Governed Reporting KPI Design

06 BI, Tools & Ops

Power BI Docker GitHub Actions Semantic Models Direct Lake Tableau Databricks Dashboards Git Power Automate JIRA

education

2023 · 2024

M.S. Computer & Information Sciences

Oklahoma State University

Stillwater, OK

2017 · 2021

B.Tech Computer Science

Jawaharlal Nehru Technological University

Hyderabad, India

contact

Got a role in mind? Let's talk.

Open to AI Data Engineer, Data Engineer, and senior Data Analyst / BI Developer roles. Remote or hybrid in Austin, TX.

dharmicreddym@gmail.com linkedin.com/in/dharmicreddy github.com/dharmicreddy