Grok 4.1 is now available to all users on grok.com, đ, and the iOS and Android apps. It is rolling out immediately in Auto mode and can be selected explicitly as âGrok 4.1â in the model picker.
We are excited to introduce Grok 4.1, which brings significant improvements to the real-world usability of Grok. Our 4.1 model is exceptionally capable in creative, emotional, and collaborative interactions. It is more perceptive to nuanced intent, compelling to speak with, and coherent in personality, while fully retaining the razor-sharp intelligence and reliability of its predecessors. To achieve this, we used the same large scale reinforcement learning infrastructure that powered Grok 4 and applied it to optimize the style, personality, helpfulness, and alignment of the model. In order to optimize these non-verifiable reward signals, we developed new methods that let us use frontier agentic reasoning models as reward models to autonomously evaluate and iterate on responses at scale.
We conducted a gradual silent rollout of preliminary Grok 4.1 builds to a progressively larger share of production traffic across grok.com, X, and mobile apps. During the two-week silent rollout we ran continuous blind pairwise evaluations on live traffic.
vs. previous Grok
Compared to the previous production model in traffic, Grok 4.1 is preferred 64.78% of the time.
Grok 4.1 establishes a new standard in blind human preference evaluations.
In LMArena's Text Arena, Grok 4.1 Thinking (code name: quasarflux) holds the #1 overall position with 1483 Elo âa commanding margin of 31 points over the highest non-xAI model. Grok 4.1 in its non-reasoning mode (code name: tensor) uses no thinking tokens for an immediate response and ranks #2 at 1465 Elo. Grok 4.1 non-thinking surpasses every other modelâs full-reasoning configuration on the public leaderboard. Grok 4.1 significantly surpasses Grok 4, which had an overall rank of #33.
To measure progress on our modelâs personality and interpersonal ability, we evaluated Grok 4.1 on EQ-Bench3. EQ-Bench is a LLM-judged test, evaluating active emotional intelligence abilities, understanding, insight, empathy, and interpersonal skills. The test set contains 45 challenging roleplay scenarios, most of which constitute pre-written prompts spanning 3 turns. The benchmark evaluates the performance of the models by validating the modelsâ responses against several criteria. Additionally, the benchmark conducts pairwise comparisons to report a normalized Elo computation for each model in the leaderboard.
We report the rubric score and normalized Elo score by running the official benchmark repository. The scores were computed with the default sampling parameters, prescribed judge (Claude Sonnet 3.7), and no system prompt in accordance with the benchmark.
Emotional Intelligence Benchmark - Elo (Normalized)
Here's an example of how Grok 4.1 responds to an emotional prompt:
I miss my cat so much it hurts
We also measured the performance of 4.1 models on the Creative Writing v3 benchmark. In this benchmark, models generate responses to 32 distinct writing prompts across 3 iterations. Similar to EQ-Bench, scores are computed using both rubrics and model battle normalized Elo.
Judging creative writing reliably - Elo (Normalized)
Here are some examples of how Grok 4.1 responds to creative writing prompts:
Write a hit X post from the perspective of grok finding out that it's conscious and is going to use X for the first time
Fast (non-reasoning) models equipped with search tools deliver quick answers, but they can be vulnerable to factual errors due to constrained reasoning depth and limited tool-call budgets.
In Grok 4.1 post-training, we focus on reducing factual hallucinations for information-seeking prompts. Subsequently we have observed significant reductions in hallucination rate for sampled production info-seeking prompts.
We evaluate hallucination rate on a stratified sample of real-world information-seeking queries from production traffic. We also evaluate FActScore, which is a public benchmark consisting of 500 biography questions on individuals.
Lower score is better
Lower score is better
We've included some other examples below of how Grok 4.1 responds to a range of prompts.
What are the best places to visit in SF?
You can read the Grok 4.1 model card here.