GPT-Vision & GPT
Overview of OpenAI GPT models with a focus on image input capabilities
The GPT
class is designed for text-to-text interactions, utilizing models from the Generative Pre-trained Transformer series. These models are adept at understanding and generating text based on the input provided.
Initialization
Your OpenAI API key. If not provided, it will attempt to use the OPENAI_API_KEY environment variable.
Model name of one of open AIβs supported models
Methods:
-
GPT.prompt(query: str) -> str
Takes a text query and returns the modelβs text response.
The GPTVision
class extends the capabilities of GPT
to include handling image inputs, making it suitable for image-to-text tasks.
This class is part of the multimodal models that can process both text and images.
Initialization:
Your OpenAI API key. If not provided, it will attempt to use the OPENAI_API_KEY environment variable.
Model name of one of open AIβs supported models
Methods:
-
GPTVision.prompt(query: str) -> str
Sends the prompt and returns the respone
-
GPTVision.prompt_with_image(image: Image.Image, query: str) -> str
Sends the prompt and image to the model and returns the response.
-
GPTVision.parse_text(image: Image.Image) -> str
Returns all text extracted from an image
When to use
GPT-4o
and GPT-4v
are the leading Vision Language models and are capable of understanding relatively fine details in images.
Although when scaling to datasets of millions of images, these models can become prohibitively expensive.
Generally, GPT-4o
is the best model to use for image-to-text tasks, and is cheaper than GPT-4v
.
Usage Scenarios
Limitations
Many GPT Visual models have relatively strict guard rails around inputs/photos of people so your requests may get screened out if they are related to these topics. These models also generally struggle with spatial reasoning(i.e. is object A in front of object B?).
For more details check the OpenAI docs