Torq Reflex: Teaching the AI SOC Judgment

Get a Personalized Demo

See how Torq harnesses AI in your SOC to detect, prioritize, and respond to threats faster.

Noam Cohen is a serial entrepreneur building seriously cool data and AI companies since 2018. Noam’s insights are informed by a unique combination of data, product, and AI expertise — with a background that includes winning the Israel Defense Prize for his work in leveraging data to predict terror attacks. As the Head of Artificial Intelligence at Torq, Noam is helping build truly next-gen AI capabilities into Torq’s autonomous SOC platform.

Your analyst tells the AI that the alert is just the red team running a scheduled pen test. They told it the same thing yesterday, and the day before. It works, then it doesn’t, then it does — and that flakiness is harder to live with than a tool that’s simply, predictably wrong.

This is happening in almost every SOC that runs AI triage, and the AI isn’t broken. It just has no memory of its own decisions. It looks at each alert individually, with no idea that your team has already settled this question a dozen times over. Every shift, it starts from scratch.

We tend to file the result under alert fatigue, but that undersells it. The higher cost is what happens to trust. Once analysts stop believing what the AI tells them, they treat it as one more box to click, and the money you spent on AI triage quietly buys you nothing. Earning that trust back is the problem the whole category has to solve next, and it’s the problem the learning layer of the Torq AI SOC Platform was built for.

Where First-Generation AI Triage Stops

The first wave of “AI SOC” tools proved that a large language model can read an alert and produce a credible analysis on day one, with no prior training on your environment. That breakthrough was great (at the time), and the day-one analysis it unlocked is part of the foundation Torq builds on. But it isn’t enough. A capable AI SOC has to keep learning, and the first generation treats day one as the ceiling.

That’s because the trouble starts after day one. The model judges every alert from scratch, and when an analyst corrects it, that correction is added to the prompt rather than the model itself. So the false positive your team carefully explained on Monday comes back looking brand new on Tuesday. Researchers call this self-inconsistency, an inherent failure mode of LLMs. Your analysts call it the reason they stopped trusting the tool.

We reviewed more than 1,000 real analyst corrections across four customer environments to understand why the AI and the analyst kept disagreeing. The reasons split cleanly into two groups.

About half the time, the AI was simply missing context. It flagged an admin who was authorized to do exactly what they did, or a piece of software the team already knew was safe, because nobody had told it who does what inside your walls. That kind of gap has a clear fix, described in the first two posts in our series: the Torq Context Graph and Torq Recall feed that organizational knowledge straight into the model. Together with Reflex, they form the memory and learning layer beneath every agent in the Torq AI SOC Platform.
The other half of disagreements are harder to reconcile, and they’re exactly what Torq Reflex was built for. In these cases, the AI wasn’t missing information. It had access to everything the analyst did, but still landed somewhere the team wouldn’t. Sometimes it contradicted its own earlier verdicts. Other times, it just read a borderline call differently than the team did, treating a contained threat as resolved when the team’s policy was to escalate it anyway, or accepting a thin set of signals as proof when the analysts wanted more. No amount of extra data closes a gap like that, because the gap is about judgment.

Why “Learning AI” From Other Vendors Often Isn’t

Everyone in this space says their AI “learns” now, and they rarely mean the same thing by it. Usually, it comes down to one of two moves: the tool rewrites its own prompt based on your feedback, or it saves your past corrections and pulls up similar ones when a new alert looks familiar. Both are useful. Both run into the same wall.

You can’t prompt your way to consistency. There’s no instruction that makes a model with no memory return the same verdict it gave yesterday; that takes a model that actually remembers what it decided.

Your team’s risk tolerance has the same problem. Something like “we treat blocked executions as malicious” isn’t really a rule. It’s a pattern that only becomes visible once you’ve watched the team rule the same way across a pile of messy edge cases. Calibration works like that, too. Analysts learn how much evidence is enough by being wrong and adjusting, not by being told to relax.

Prompts and lookups can describe your team’s judgment well enough. But they can’t absorb it; they’re limited and inconsistent, and they offer no real confidence in the answer. That’s the line Reflex crosses.

There’s a quieter problem, too: model swapping. As providers upgrade and retire the underlying LLMs, behavior shifts beneath you. Our research shows you have to tune both the prompt and the model, but when the model changes, customer instructions rarely get retuned along with it, which creates more inconsistency and less confidence at the worst possible time. Because Reflex is a stateful model trained on your team’s own calls, it stays resilient to those model changes.

How Torq Reflex Works

The easiest way to picture it is Spotify. Out of the box, Spotify plays music that appeals broadly to everyone: the hits and the editorial playlists. That’s the equivalent of the LLM baseline, and every AI SOC ships with a version of it. Discover Weekly is the part makes Spotify feel personal, built from what you play and what you skip.

Reflex plays the role of Discover Weekly. The general model is the editorial playlist that works for any customer, while Reflex is a model trained on how your team specifically makes decisions, learning more every time an analyst corrects it. It isn’t pulling the nearest past case off a shelf. It’s a classifier that has taken your patterns on board well enough to judge alerts it has never seen before.

Walk an alert through it, and the flow is simple. The alert arrives, and Reflex — a trained language model — assigns the most likely verdict along with a calibrated confidence score. If that confidence is high, its verdict shapes the output. If it isn’t, the full LLM analysis runs, and the case goes to an analyst, just as it would if Reflex weren’t there at all. Whatever the analyst decides flows back in and retrains the model, so each correction is worth a little more than the last.

And because Reflex lives inside the Torq AI SOC Platform, a confident verdict doesn’t just label an alert and stop. It feeds Auto Triage and case creation, and then orchestrates the response from Socrates across Torq HyperAgents™. So the judgment your team teaches Reflex makes every downstream action across the full threat lifecycle more trustworthy.

That confidence number is the advantage of training a model instead of running a search. A lookup can tell you an alert resembles something a human fixed once. Reflex gives you a real, calibrated probability, which is what makes the whole arrangement safe to run. It weighs in when it’s sure and holds its tongue when it isn’t, so it can only help: anything it’s uncertain about goes right back to the pipeline you already trust.

Reflex is a purpose-built language-model design, engineered to learn from a small number of samples and to operate under asymmetric risk, where missing a malicious verdict is far more dangerous than mislabeling a benign one. The same math lets you set your own tolerance for the two kinds of mistakes. Missing a real threat hurts more than chasing a false alarm, so by default, Reflex treats a missed malicious error as the costlier one, and you can dial that weighting to match how your organization actually thinks about risk. It beats typing “THIS IS VERY CRITICAL” into a prompt and hoping the model takes the hint.

What the Data Shows

We ran Reflex against the semantic-similarity approach most “learning” platforms lean on, across those same three environments.

The number we care about most is the performance on alerts where the AI was wrong and a person had to step in. Reflex matched the analyst’s corrected verdict 92% of the time and held within about three points of that across folds. The similarity approach managed 78%. On overall accuracy, the trained model came out roughly 11 points ahead, and on the call that carries the most weight — confirming a real threat — it gained around 22 points over the baseline.

A tight spread means the result isn’t a fluke, and Reflex gets there in roughly two weeks of normal analyst feedback. The early weeks are bumpier while the model finds its feet, then it sharpens with every retraining pass. Most AI triage systems perform exactly the same on day 100 as they did on day one. Reflex draws a curve you can put in front of your board.

What It Means for Your Team

Fewer repeat corrections: Your analysts stop re-teaching lessons that the system should already know. A correction made once carries forward to every alert like it.
Consistency across shifts: The same alert lands on the same verdict, whether it’s a senior analyst at 2pm or a newer one at 3am.
Real confidence scores: You get a calibrated probability instead of hedge language, so high-confidence cases move fast while the genuinely uncertain cases land in front of a person for judgment.
Error trade-offs you control: You decide how much worse a missed threat is than a false alarm, and the model enforces it.
A curve you can measure: The system improves in ways you can actually report on, quarter after quarter.
Built for oversight: The EU AI Act‘s high-risk rules take effect in August 2026. Reflex’s confidence-based routing with a human fallback already aligns with the Act’s oversight and transparency requirements.

Your Data, Your Model

Your model is yours and no one else’s. We don’t pool training data across customers, nor do we share parameters between them. Your red team’s patterns and your policy calls never leave your tenant or quietly shape a competitor’s experience. The model works for you because you’re the one who trained it.

That’s what this series has been building toward. That same data discipline is what makes a genuine human-on-the-loop model possible. Most SOCs run human-in-the-loop out of necessity. An analyst checks the AI on nearly every alert, because there’s no reliable way to know which verdicts to trust.

Reflex changes that calculation because it produces a calibrated confidence score grounded in your team’s own calls, it can act on the cases it’s sure about and send the uncertain ones to a person. Your analysts shift from reviewing everything to supervising the system and stepping in where their judgment actually counts. They stay on the loop, with an auditable record of what the model decided and why.

This is the final piece of our series, Context, Memory, and Learning in the AI SOC: context grounds the agents, memory gives them precedent, and learning teaches them your team’s judgment. Together, they’re the layer that makes an autonomous AI SOC something you can actually trust.

Read the Series

Torq Reflex: Teaching the AI SOC Judgment

Contents

Get a Personalized Demo

Where First-Generation AI Triage Stops

Why “Learning AI” From Other Vendors Often Isn’t

How Torq Reflex Works

What the Data Shows

What It Means for Your Team

Your Data, Your Model

Ready to automate everything?

Contents

Get a Personalized Demo

Where First-Generation AI Triage Stops

Why “Learning AI” From Other Vendors Often Isn’t

How Torq Reflex Works

What the Data Shows

What It Means for Your Team

Your Data, Your Model

Related Articles

Surviving the AI SOC Apocalypse

The AI SOC Apocalypse Is Here

Claude Mythos Broke the SOC. Agentic AI Fixes It.

Ready to automate everything?

See Torq in Action