Extras are currently only supported on Linux
Overeasy currently supports int4 quantization for running models like QwenVL. To use these models, you will need to install AutoGPTQ in a performant manner. Make sure to build the relevant CUDA extensions! In our example Colab, we install AutoGPTQ but we also install a prebuilt wheel so things are a bit easier. You can install AutoGPTQ from pip using the provided instructions. Alternatively, you can install AutoGPTQ from source:
Source Install
!pip install optimum tiktoken gekko einops transformers_stream_generator accelerate
!pip install git+https://github.com/AutoGPTQ/AutoGPTQ@v0.7.1
Note: Building the source install can take around 20 minutes.
After installing, you can use the int4 quantized model like this, as long as you have over 11GB of VRAM:
from overeasy import QwenVL
model = QwenVL("int4")
model.load_resources()
response = model.prompt("What is the capital of the California?")