Open to new opportunities · Austin, TX

Dharmic Reddy Meka_

AI Data Engineer · Sr Data Analyst @ UT Austin

I build data pipelines, lakehouses, and AI retrieval systems that power analytics at scale. Currently shipping BI infrastructure on Microsoft Fabric for a $2B+ capital projects portfolio at UT Austin. Previously at Accenture (UBS) and Oklahoma State University.

#

about

I'm a data engineer focused on the infrastructure that makes analytics and AI possible. At UT Austin, I architect governed lakehouses on Microsoft Fabric for a $2B+ capital projects portfolio and ship the Azure Data Factory pipelines that feed them.

Before this, I built Databricks + PySpark ETL at Accenture (UBS) that serves 200+ business users, and engineered an end-to-end NLP pipeline at Oklahoma State that processed 50K+ news articles for election research.

Lately I've been going deep on vector search and retrieval systems. My latest side project is a production-shaped RAG pipeline over arXiv papers using Postgres + pgvector. Targeting AI Data Engineer and Data Engineer roles next.

locationAustin, Texas
currentSr Data Analyst · UT Austin
focusData Engineering · AI retrieval
experience3+ years production systems
statusOpen to opportunities
99.5%
Pipeline reliability across production Databricks + ADF workflows
87%
ETL runtime cut from 6+ hrs to 45 min via ADF + PySpark orchestration
$2B+
Capital portfolio served by a governed Fabric lakehouse
50K+
Articles processed through end-to-end Python + NLP pipeline
#

work

Mar 2025 · Now
Current

Sr Data Analyst

The University of Texas at Austin
Planning, Design & Construction · Austin, TX
Microsoft Fabric Azure Data Factory Dataflows Gen2 Direct Lake Star Schema DAX Power Automate
  • Architected a governed lakehouse semantic layer on Microsoft Fabric with Direct Lake and star-schema restructuring for a $2B+ capital projects portfolio. Cut query refresh time by 60%+ and eliminated cross-workspace inconsistencies across 5 PDC teams.
  • Built automated ingestion pipelines on Azure Data Factory and Fabric Dataflows Gen2 consolidating financial data from 4+ source systems into a single governed reporting layer. Saved 3 hours per reporting cycle of manual data prep.
  • Designed dimensional models and optimized DAX using SUMMARIZECOLUMNS patterns and calculation groups. Cut ad-hoc data requests from 15 to 7 per month (53% reduction) and accelerated budget reviews by 3 days.
  • Migrated 15+ legacy Tableau workloads onto the governed Power BI semantic layer, standardizing KPI logic and fiscal hierarchies across teams.
  • Orchestrated Power Automate workflows for dataset refresh alerts, multi-stage approval routing, and exception flagging. Eliminated 5+ hours per week of manual follow-up.
Aug 2024 · Mar 2025

Data Analyst · NLP Research

Oklahoma State University
Stillwater, OK
Python BeautifulSoup Scrapy NLTK NLP Pipelines PostgreSQL Data Quality
  • Built an end-to-end Python data pipeline (BeautifulSoup, Scrapy) that ingested and structured 50,000+ U.S. news articles into PostgreSQL with automated data-quality checks. This formed the foundation layer for all downstream NLP work.
  • Designed an NLP sentiment scoring pipeline using tokenization, normalization, and lemmatization to extract candidate-level media polarity across the 2024 U.S. election cycle. Surfaced measurable bias patterns across 10+ news sources.
  • Delivered the analytics layer on top of this pipeline so the research team could identify statistically significant coverage disparities used in published findings.
Aug 2021 · Dec 2022

Analytics Engineer

Accenture · Financial Services
Client: UBS · Hyderabad, India
Databricks PySpark Azure Data Factory REST APIs Star Schema SQL Data Marts
  • Built and maintained Databricks ETL pipelines in PySpark that ingested financial data from REST APIs, relational databases, and flat files at scale, with automated validation checks and structured error handling.
  • Orchestrated production workflows via Azure Data Factory. Reduced average pipeline processing time from 6+ hours to under 45 minutes with 99.5% reliability across reporting cycles.
  • Designed star-schema dimensional models and governed data marts supporting P&L tracking, KPI reporting, and executive analytics. Created a single source of truth consumed by 200+ business users across multiple business units.
  • Translated complex financial requirements into a governed analytics layer adopted by senior leadership for monthly performance reviews across UBS business units.
#

projects

Data Engineering · Sports ML

IPL Winner Prediction

End-to-end data engineering pipeline for IPL match prediction. Built on licensed open data and official APIs only, no Terms-of-Service compromises. Ingests historical match data from Cricsheet (ODbL), venue metadata from Wikipedia (CC BY-SA), and upcoming fixtures from CricketData.org. Flows through a dbt warehouse with star schema, feeds an XGBoost classifier with probability calibration, and serves predictions through a deployed Streamlit dashboard.

What I built
  • dbt warehouse with bronze, silver, and gold layers. Star schema across fact_matches, fact_ball_by_ball, dim_teams, dim_venues, dim_players, plus a team_canonical seed for entity resolution.
  • SCD Type 2 snapshot tracking team rebrandings (e.g. RCB Bangalore to Bengaluru) so historical matches stay tied to the right entity.
  • Strict walk-forward modeling split (train: 2022, val: early 2023, holdout: late 2023 + 2024). XGBoost beats baseline by 9.8pp accuracy on the 102-match holdout.
  • Probability calibration with reliability diagram, Brier score, and ECE reported honestly in a published model card. MLflow tracks every run.
  • Dual-runtime orchestration: same Python entrypoints run inside both GitHub Actions (weekly cron, ephemeral Postgres service container) and a local Airflow DAG.
  • Live Streamlit dashboard with three pages (Predict, Calibration, Data). Reads from a SQLite snapshot bundled in the repo so the deploy is free-tier.
Tech stack
LanguagePython
WarehousePostgreSQL · dbt Core
MLscikit-learn · XGBoost · calibration · MLflow
OrchestrationGitHub Actions (prod) · Airflow (local demo)
DashboardStreamlit Community Cloud · SQLite snapshot
IngestionPython · httpx · bulk download + REST APIs
Data SourcesCricsheet (ODbL) · Wikipedia (CC BY-SA) · CricketData.org
PatternsStar Schema · SCD Type 2 · Walk-forward CV
InfraDocker · docker-compose · ephemeral PG service
XGBoost +9.8pp · 218 matches · Live Streamlit demo
View on GitHub
AI Data Engineering · Vector Search

arXiv RAG Pipeline with pgvector

Repo

A production-shaped RAG retrieval layer over recent arXiv ML papers. Not a notebook demo. Pulls papers via the arXiv API, chunks titles and abstracts with a sliding window, embeds each chunk on CPU with sentence-transformers, stores everything in Postgres + pgvector with an HNSW index, and serves semantic search behind a typed FastAPI endpoint with category filtering.

What I built
  • One Postgres for metadata and vectors. arXiv category filter and cosine ranking happen in a single SQL query. No separate vector DB service.
  • HNSW cosine index aligned with vector_cosine_ops on L2-normalized embeddings so the operator and index agree for correct ANN ranking.
  • psycopg3 connection pool with pgvector type adapter registered at connect time. numpy.ndarray and vector stay allocation-free on ingest and search.
  • Resilient arXiv ingestion: httpx + feedparser + tenacity retries, 3-second rate limiting, paginated fetch with idempotent delete-then-insert per paper.
  • GitHub Actions CI spins up a real pgvector/pgvector:pg16 service container to verify migrations and ANN search end to end.
Tech stack
LanguagePython 3.11
APIFastAPI · Pydantic v2
Vector DBPostgres 16 · pgvector · HNSW (cosine)
Embeddingssentence-transformers · all-MiniLM-L6-v2 (384-dim)
DB Driverpsycopg3 · connection pool · pgvector adapter
Ingestionhttpx · feedparser · tenacity retries
Configpydantic-settings · structlog
InfraDocker · docker-compose · GitHub Actions CI
Testingpytest · mocked API · pgvector integration
pgvector HNSW · 384-dim embeddings · FastAPI /search
View on GitHub
Data Engineering · Healthcare

Multi-Source Healthcare Claims Lakehouse

Repo

End-to-end Databricks lakehouse that ingests healthcare claims from three source formats: structured CSV billing, nested JSON provider records, and unstructured PDF clinical notes. Flows through a Medallion (Bronze, Silver, Gold) architecture on Delta Lake into a star schema, with a 12-chart analytics dashboard on top.

What I built
  • Regex-based NLP extraction from clinical notes: vitals, diagnoses, medications, follow-up windows.
  • SCD Type 2 on the provider dimension for point-in-time queries on specialty and network status changes.
  • Custom data quality framework simulating Delta Live Tables expectations (warn / drop / fail), logged to a DQ table for monitoring.
  • Star schema with fact_claims and 5 dimensions, referential integrity validated, powering denial-rate and network-comparison analytics.
Tech stack
LanguagePySpark · Spark SQL · Python
PlatformDatabricks · Unity Catalog
StorageDelta Lake · Managed Delta Tables
PatternsMedallion · Star Schema · SCD Type 2
DQCustom DQ framework · DLT-style expectations
VizDatabricks Dashboards · 12 chart types
3 source formats · 15 Delta tables · 12 visualizations
View on GitHub
Data Engineering · Retail

Retail Sales Analytics Pipeline

Repo

Production-style ETL pipeline on Databricks processing 541,909 real UK e-commerce transactions from the Kaggle Online Retail II dataset. Runs from raw ingestion through cleaned Silver to four aggregated Gold tables, feeding a scheduled AI/BI dashboard that refreshes daily at 06:00 UTC.

What I built
  • Bronze to Silver to Gold layering with Delta Lake: cancellation removal, dedup, type casting, derived revenue columns.
  • RFM customer segmentation surfacing 4,346 customers tiered High / Mid / Low value.
  • Scheduled Workflow orchestrating bronze to silver to gold as sequential dependent tasks, daily at 06:00 UTC.
  • 4 Gold tables powering the dashboard: monthly revenue by country, top 3,896 products, RFM tiers, day-of-week sales.
Tech stack
LanguagePySpark · Python
PlatformDatabricks · Apache Spark
StorageDelta Lake
OrchestrationDatabricks Workflows · Scheduled Jobs
VizDatabricks AI/BI Dashboards
SourceKaggle API · Online Retail II (2010-2011)
PatternMedallion · RFM Segmentation
541K transactions · 4 Gold tables · Daily scheduled job
View on GitHub
#

skills

01 Data Engineering & Pipelines

Microsoft Fabric Databricks Azure Data Factory PySpark dbt Apache Airflow Apache Spark Dataflows Gen2 Databricks Workflows Medallion Architecture

02 Programming & Query

Python SQL PySpark Pandas T-SQL Stored Procedures Query Optimization DAX M (Power Query)

03 AI, ML & Vector Search

RAG Pattern pgvector XGBoost MLflow scikit-learn sentence-transformers HNSW Indexing Semantic Search Embeddings NLP Pipelines NLTK BeautifulSoup Scrapy FastAPI Streamlit

04 Databases & Cloud

Delta Lake PostgreSQL Snowflake pgvector SQL Server Azure Synapse OneLake AWS

05 Architecture & Modeling

Star Schema Dimensional Modeling SCD Type 2 Data Marts Data Quality Governed Reporting KPI Design

06 BI, Tools & Ops

Power BI Docker GitHub Actions Semantic Models Direct Lake Tableau Databricks Dashboards Git Power Automate JIRA
#

education

2023 · 2024

M.S. Computer & Information Sciences

Oklahoma State University
Stillwater, OK
2017 · 2021

B.Tech Computer Science

Jawaharlal Nehru Technological University
Hyderabad, India
#

contact

Got a role in mind? Let's talk.

Open to AI Data Engineer, Data Engineer, and senior Data Analyst / BI Developer roles. Remote or hybrid in Austin, TX.