Sneh’s Substack: Experience

Molina Healthcare

Sneh Vora — Wed, 17 Jun 2026 15:34:54 GMT

I didn’t expect my first real GenAI project to scare me.

Not the technical parts — those were hard, but manageable. What scared me was the realization, about two months in, that a wrong answer from our system could delay a patient’s authorization. That a hallucinated policy reference could mean someone waiting longer than they should for care.

That changed everything about how I built.

The Problem We Were Actually Solving

When I joined Molina Healthcare as an AI/ML Engineer in early 2025, the ask sounded clean: build a GenAI assistant to help staff search through healthcare policy documents.

But when I sat down with the people actually doing this work, the picture got sharper and messier.

A prior authorization specialist might need to cross-reference three different policy documents — coverage guidelines, authorization criteria, and clinical SOPs — just to answer one question. These weren’t short documents. We’re talking 8,000+ pages across hundreds of files, updated regularly, full of domain-specific language that doesn’t map cleanly to how anyone naturally asks a question.

They weren’t looking for a chatbot. They were looking for a trusted colleague who had read every policy ever written and could give them the right paragraph, right now.

That’s a very different thing to build.

What I Thought RAG Was vs. What It Actually Is

I’d worked with RAG before. The concept is straightforward: retrieve relevant chunks from a document store, feed them to an LLM, get a grounded answer. Simple enough on paper.

What I hadn’t fully reckoned with was how much the quality of your retrieval determines the quality of everything downstream.

We ingested all 8,000+ pages using PyMuPDF — extracting text, cleaning formatting artifacts, splitting documents into chunks that were small enough to be precise but large enough to hold context. That chunking strategy alone took two weeks of iteration. Too small and you lose the surrounding context that makes a clause meaningful. Too large and you’re feeding the model noise.

Then we indexed everything into Azure AI Search and layered FAISS on top for embedding-based similarity scoring. We added metadata filters — document type, policy category, effective date — so retrieval wasn’t just semantic, it was structured.

That combination moved our retrieval accuracy by 22% in internal testing. But what the number doesn’t capture is why: it’s because we stopped treating retrieval as a search problem and started treating it as a comprehension problem. The system needed to understand what the staff member was really asking, not just match keywords.

The Lesson I Didn’t Expect: Guardrails Are the Product

Early on, I thought of PHI masking and RBAC controls as compliance checkboxes. Things you add at the end so legal signs off.

I was wrong.

In a healthcare context, the guardrails are what make the product usable. A system that gives a fast, accurate answer but leaks protected health information in the process isn’t a product — it’s a liability. A system that can be queried by anyone regardless of role isn’t a tool — it’s a risk.

When we implemented PHI masking, few-shot prompting for consistency, and role-based access controls, we weren’t adding friction. We were building the foundation of trust that makes anyone willing to rely on the system in the first place.

The 30% reduction in manual lookup time is the metric I put on my resume. But what I’m actually proud of is that the team started using it without being told to — because they trusted it.

What I’d Tell Anyone Building AI in a Regulated Industry

Accuracy is the table stakes, not the goal. Every RAG system can surface relevant text. The question is whether it surfaces the right text with enough context for a human to act on it responsibly.

Build with the person doing the job, not just for them. The specialists who used our system daily spotted retrieval failures I never would have caught in testing. Their feedback shaped the metadata filters, the prompt structure, and the confidence thresholds we used.

A wrong answer at speed is worse than a slow right one. We added latency to certain query paths on purpose — to trigger human review for high-stakes authorization decisions. That was the right call.

Where This Is Going

Healthcare AI is at an interesting inflection point. The technology is good enough to be genuinely useful. The harder problem is institutional trust — getting clinicians, administrators, and compliance teams to believe that an AI-assisted workflow is safer and more consistent than a manual one.

That trust isn’t built through demos. It’s built through guardrails, evaluation, transparency about what the system doesn’t know, and a long track record of getting it right.

I’m still building that track record. But I’m more convinced than ever that the engineers who will matter most in this space aren’t the ones who can build the fastest model — they’re the ones who understand what it means when that model is wrong.

Accenture

Sneh Vora — Wed, 17 Jun 2026 15:31:21 GMT

The first time I looked at our confusion matrix, I thought we had a great model.

96% accuracy. Clean numbers. My manager nodded. I felt good.

Then a senior data scientist on the team asked me one question: “What’s the base rate of fraud in the dataset?”

About 0.4%.

She didn’t say anything else. She didn’t need to. A model that predicted “not fraud” for every single transaction would have been 99.6% accurate — and completely useless.

That was week three at Accenture. I had a lot more to learn.

What the Project Actually Was

I joined Accenture as a Machine Learning Engineer in mid-2021, working on a fraud detection and risk scoring platform for a large BFSI client. Banking, financial services, insurance — an industry where the cost of getting something wrong isn’t abstract. It’s real money, real customers, real consequences.

Our task: analyze over a million transaction records and build a system that could identify fraudulent activity, flag anomalies, and generate risk scores for review.

On paper, it sounded like a classic ML project. Train a classifier, evaluate on a test set, deploy.

In reality, it was one of the most humbling experiences of my career.

The Data Was the Actual Job

Before I wrote a single line of model code, I spent weeks just understanding the data.

Fraud data has a problem that most textbook ML datasets don’t: it’s almost entirely negative. Fraud is rare by design — bad actors try hard not to look like bad actors. In our dataset, fraudulent transactions were a tiny fraction of the total. If your model learns to just say “not fraud” every time, it will score well on accuracy and fail completely at the one thing it’s supposed to do.

So we got to work on the feature engineering side — and this is where most of the real value was created.

We built over 40 features: transaction frequency patterns, amount deviation from a customer’s historical baseline, merchant category mismatches, device fingerprints, geographic anomalies, account age, payment channel behavior, and historical fraud indicators. Each of these was a hypothesis about what a fraudulent transaction looks like, encoded into something a model could learn from.

Then came the imbalance problem. We used SMOTE to generate synthetic minority-class samples, combined with undersampling and class-weight tuning. That combination improved our fraud recall by nearly 20% — meaning we caught significantly more actual fraud — while keeping false positives at a level the client’s review team could actually handle.

That balance matters more than people realize. Flag too much and your human reviewers drown. Flag too little and the fraud slips through. The model isn’t making that decision alone — it’s making it in partnership with the people downstream.

The Model Wasn’t the Hard Part

We landed on XGBoost for the core classifier. It handled the tabular data well, gave us interpretable feature importances, and responded well to hyperparameter tuning. After cross-validation and threshold optimization, we were sitting at roughly 0.86 ROC-AUC — an 18–22% improvement over the baseline the client had been using.

We also layered in Isolation Forest for anomaly detection. The XGBoost model was trained on historical fraud patterns — it was good at catching things that looked like fraud it had seen before. Isolation Forest was good at catching things that just looked weird, regardless of whether they matched a known pattern. Together they covered more surface area.

But here’s what I actually learned from all of this: the model performance metrics were almost never the most important conversation.

The most important conversations were about thresholds.

At what score do you flag a transaction for human review? At what score do you block it automatically? What’s the cost of a false positive to a legitimate customer who gets their card declined? What’s the cost of a false negative to the client when a fraudulent transaction goes through?

These aren’t ML questions. They’re business questions. And the answers changed depending on who you asked — the risk team, the product team, the compliance team, the executive sponsor. Part of my job was translating between what the model could do and what those conversations actually required.

The Lesson That Has Stayed With Me

I came into Accenture thinking the job was to build a good model.

I left understanding that the job was to build a good decision system — one where the model was one component, the features were another, the threshold logic was another, the human review process was another, and the monitoring pipeline that caught when the distribution shifted was another.

A model that performs well in evaluation and degrades quietly in production without anyone noticing isn’t a success. It’s a slow failure.

We implemented monitoring and batch scoring automation partly to speed up the review workflow, but also to give the team a way to see when something was changing — when fraud patterns were evolving, when a new attack vector was emerging, when the model’s assumptions were starting to drift from reality.

That’s the part of ML that doesn’t make it into blog posts about model architectures. It’s less exciting than a good ROC curve. But it’s the difference between a project and a product.

What I’d Tell a Junior ML Engineer Starting Out

Fall in love with the features, not the model. Anyone can throw XGBoost at a dataset. The people who create real value understand the domain well enough to know which signals matter and why.

Learn to think in thresholds, not accuracy. Almost every real ML decision involves a tradeoff between precision and recall, between false positives and false negatives. Understand the asymmetric costs in your domain before you touch a single hyperparameter.

The model is not the system. Data pipelines, feature stores, monitoring, human-in-the-loop workflows — these are not afterthoughts. They are the product.

Ask the dumb question. The confusion matrix lesson in week three came because someone asked a question I should have asked myself. In every project since, I’ve tried to be the person who asks it first.

Fraud detection taught me that machine learning in the real world is less about elegance and more about understanding what happens when you’re wrong — and building something resilient enough to handle it.

I’m still building on that foundation.