Introducing Grok-1.5V, our first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.
Capabilities
Grok-1.5V is competitive with existing frontier multimodal models in a number of domains, ranging from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. We are particularly excited about Grok’s capabilities in understanding our physical world. Grok outperforms its peers in our new RealWorldQA benchmark that measures real-world spatial understanding. For all datasets below, we evaluate Grok in a zero-shot setting without chain-of-thought prompting.
Benchmark | Grok-1.5V | GPT-4V | Claude 3 Sonnet | Claude 3 Opus | Gemini Pro 1.5 |
---|---|---|---|---|---|
MMMU Multi-discipline | 53.6% | 56.8% | 53.1% | 59.4% | 58.5% |
Mathvista Math | 52.8% | 49.9% | 47.9% | 50.5% | 52.1% |
AI2D For samples missing annotations such as A, B, C, etc., we render bounding boxes and corresponding letters at the relevant areas in the image. Diagrams | 88.3% | 78.2% | 88.7% | 88.1% | 80.3% |
TextVQA Text reading | 78.1% | 78.0% | - | - | 73.5% |
ChartQA Charts | 76.1% | 78.5% | 81.1% | 80.8% | 81.3% |
DocVQA Documents | 85.6% | 88.4% | 89.5% | 89.3% | 86.5% |
RealWorldQA Real-world understanding | 68.7% | 61.4% | 51.9% | 49.8% | 67.5% |
Writing code from a diagram
Sample 1/7
User
Grok
Certainly! The flowchart you’ve provided describes a simple guessing game where the computer generates a random number, and the user has to guess it. Here’s the Python code that represents the logic in the flowchart:
1import random
2
3def guess_number(): # Generate a random number between 1 and 10
4
5 target = random.randint(1, 10)
6
7 while True:
8 guess = int(input("Guess a number between 1 and 10: "))
9
10 if guess == target:
11 print("You won!")
12 break
13 else:
14 print("Wrong guess, try again!")
15
16if __name__ == "__main__":
17 guess_number()
This code will run the guessing game as described in the flowchart. The user will be prompted to enter their guess until they guess the correct number.
Real-World Understanding
In order to develop useful real-world AI assistants, it is crucial to advance a model's understanding of the physical world. Towards this goal, we are introducing a new benchmark, RealWorldQA. This benchmark is designed to evaluate basic real-world spatial understanding capabilities of multimodal models. While many of the examples in the current benchmark are relatively easy for humans, they often pose a challenge for frontier models.
Which object is larger the pizza cutter or the scissors? A. The pizza cutter is larger. B. The scissors is larger. C. They are about the same size.
Where can we go from the current lane? A. Turn left. B. Go straight. C. Turn left and go straight. D. Turn right.
Given this front camera view from our sedan, do we have enough space to drive around the gray car in front of us? A. Yes. B. No.
Given the picture, in which cardinal direction is the dinosaur facing? A. North. B. South. C. East. D. West.
The initial release of the RealWorldQA consists of over 700 images, with a question and easily verifiable answer for each image. The dataset consists of anonymized images taken from vehicles, in addition to other real-world images. We are excited to release RealWorldQA to the community, and we intend to expand it as our multimodal models improve. RealWorldQA is released under CC BY-ND 4.0. Click here (677MB) to download the dataset.
Into the Future
Advancing both our multimodal understanding and generation capabilities are important steps in building beneficial AGI that can understand the universe. In the coming months, we anticipate to make significant improvements in both capabilities, across various modalities such as images, audio, and video.
If you want to be a part of this journey, we are hiring.