Grok-1.5 Vision Preview

Introducing Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.

Capabilities

Grok-1.5V is competitive with existing frontier multimodal models in a number of domains, ranging from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. We are particularly excited about Grok’s capabilities in understanding our physical world. Grok outperforms its peers in our new RealWorldQA benchmark that measures real-world spatial understanding. For all datasets below, we evaluate Grok in a zero-shot setting without chain-of-thought prompting.

Benchmark	Grok-1.5V	GPT-4V	Claude 3 Sonnet	Claude 3 Opus	Gemini Pro 1.5
MMMU Multi-discipline	53.6%	56.8%	53.1%	59.4%	58.5%
Mathvista Math	52.8%	49.9%	47.9%	50.5%	52.1%
AI2D For samples missing annotations such as A, B, C, etc., we render bounding boxes and corresponding letters at the relevant areas in the image. Diagrams	88.3%	78.2%	88.7%	88.1%	80.3%
TextVQA Text reading	78.1%	78.0%	-	-	73.5%
ChartQA Charts	76.1%	78.5%	81.1%	80.8%	81.3%
DocVQA Documents	85.6%	88.4%	89.5%	89.3%	86.5%
RealWorldQA Real-world understanding	68.7%	61.4%	51.9%	49.8%	67.5%

Writing code from a diagram

Sample 1/7

User

Can you translate this into Python code?

Grok

Certainly! The flowchart you’ve provided describes a simple guessing game where the computer generates a random number, and the user has to guess it. Here’s the Python code that represents the logic in the flowchart:

python

1import random
2
3def guess_number(): # Generate a random number between 1 and 10
4
5    target = random.randint(1, 10)
6
7    while True:
8        guess = int(input("Guess a number between 1 and 10: "))
9
10        if guess == target:
11            print("You won!")
12            break
13        else:
14            print("Wrong guess, try again!")
15
16if __name__ == "__main__":
17    guess_number()

This code will run the guessing game as described in the flowchart. The user will be prompted to enter their guess until they guess the correct number.

Real-World Understanding

In order to develop useful real-world AI assistants, it is crucial to advance a model's understanding of the physical world. Towards this goal, we are introducing a new benchmark, RealWorldQA. This benchmark is designed to evaluate basic real-world spatial understanding capabilities of multimodal models. While many of the examples in the current benchmark are relatively easy for humans, they often pose a challenge for frontier models.

A picture of various household objects on a table.

Which object is larger the pizza cutter or the scissors? A. The pizza cutter is larger. B. The scissors is larger. C. They are about the same size.

A picture of a dark street scene with a sign that says "left lane must turn left"

Where can we go from the current lane? A. Turn left. B. Go straight. C. Turn left and go straight. D. Turn right.

A picture of a street with several cars parked on the left

Given this front camera view from our sedan, do we have enough space to drive around the gray car in front of us? A. Yes. B. No.

A picture of a toy next to a cell phone showing a compass.

Given the picture, in which cardinal direction is the dinosaur facing? A. North. B. South. C. East. D. West.

The initial release of the RealWorldQA consists of over 700 images, with a question and easily verifiable answer for each image. The dataset consists of anonymized images taken from vehicles, in addition to other real-world images. We are excited to release RealWorldQA to the community, and we intend to expand it as our multimodal models improve. RealWorldQA is released under CC BY-ND 4.0. Click here (677MB) to download the dataset.

Into the Future

Advancing both our multimodal understanding and generation capabilities are important steps in building beneficial AGI that can understand the universe. In the coming months, we anticipate to make significant improvements in both capabilities, across various modalities such as images, audio, and video.

If you want to be a part of this journey, we are hiring.