To run a ChatGPT fine-tuned model locally, export your training data from the OpenAI API, then use it to fine-tune a compatible open-source model like Mistral 7B or Llama 2 with a library such as Unsloth. OpenAI does not allow direct download of fine-tuned model weights, so this workaround retrains an open-source base model on your original training data to produce a functionally similar model you fully control. This guide covers the complete process from data export through local inference setup.
Understanding the Constraints
Before proceeding, it is important to understand what OpenAI actually provides. When you fine-tune a model through the OpenAI API, you do not receive the full model weights directly. Instead, you get an API endpoint that hosts your fine-tuned model on OpenAI’s infrastructure. This is different from some other platforms that allow direct weight downloads.
The good news is that OpenAI provides a workaround. You can export your fine-tuning training data and use it to train a compatible open-source model locally. While this will not give you an exact replica of OpenAI’s hosted model, it produces a functionally similar model that you can run on your own hardware.
Exporting Your Fine-Tuning Data
The first step is retrieving your training data from OpenAI. You need your fine-tuning job ID to access this information. You can find this ID in the OpenAI dashboard or through the API.
Here is how to retrieve your fine-tuning job details using the OpenAI Python library:
import openai
client = openai.OpenAI(api_key="your-api-key")
# Replace with your fine-tuning job ID
job_id = "ft-your-job-id"
# Retrieve the job details
job = client.fine_tuning.jobs.retrieve(job_id)
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
print(f"Model: {job.model}")
print(f"Training file: {job.training_file}")
Once you have the training file ID, download the actual training data:
# Retrieve the training file
training_file_id = job.training_file
training_file = client.files.retrieve(training_file_id)
# Download the file content
file_content = client.files.content(training_file_id)
# Save to a local file
with open("fine_tuning_data.jsonl", "wb") as f:
f.write(file_content.read())
print("Training data saved to fine_tuning_data.jsonl")
This JSONL file contains all the prompt-completion pairs you used to fine-tune your model. You will use this data to train a local model.
Choosing a Local Model
Several open-source models can serve as a base for fine-tuning with your data. The choice depends on your performance requirements and hardware capabilities:
- Llama 2 from Meta offers a good balance between performance and resource requirements
- Mistral provides excellent performance with relatively modest hardware needs
- Phi-2 from Microsoft is compact and runs well on consumer hardware
- GPT-NeoX gives you full control over the training process
For most use cases, Mistral 7B provides the best combination of capability and efficiency. It runs on a single GPU with 24GB of VRAM and produces quality outputs comparable to much larger models.
Fine-Tuning Locally
With your training data exported, you can now fine-tune a local model. The most straightforward approach uses the Unsloth library, which significantly speeds up the fine-tuning process and reduces memory requirements:
from unsloth import FastLanguageModel
import torch
# Set up the model with 4-bit quantization for efficiency
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-2-7b-hf",
max_seq_length=2048,
dtype=torch.float16,
load_in_4bit=True,
)
# Add LoRA adapters for efficient fine-tuning
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
Now prepare your training data in the correct format and start the fine-tuning process:
from unsloth import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=formatted_dataset,
dataset_text_field="text",
max_seq_length=2048,
training_arguments=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
save_strategy="epoch",
output_dir="outputs",
),
)
trainer.train()
The training time depends on your dataset size and hardware. A typical fine-tuning job on a single A100 GPU takes a few hours for most datasets.
Converting and Saving the Model
After fine-tuning completes, save your model in a format suitable for local inference:
# Save the model and tokenizer
model.save_pretrained("my-fine-tuned-model")
tokenizer.save_pretrained("my-fine-tuned-model")
# If you used LoRA, merge the adapters first
merged_model = model.merge_and_unload()
merged_model.save_pretrained("my-fine-tuned-model-merged")
Running Inference Locally
Now you can run inference without any API calls. Here is a simple example using the transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "my-fine-tuned-model-merged"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.float16,
)
def generate_response(prompt, max_tokens=256):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Test your model
response = generate_response("Your prompt here")
print(response)
Optimizing for Production
For production deployments, consider these optimizations:
- Quantization: Use 4-bit or 8-bit quantization to reduce memory requirements
- vLLM: Deploy using vLLM for significantly faster inference throughput
- ONNX export: Convert to ONNX format for cross-platform compatibility
# Example using vLLM for fast inference
from vllm import LLM, SamplingParams
llm = LLM(
model="my-fine-tuned-model-merged",
quantization="awq",
tensor_parallel_size=1,
)
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)
Cost Comparison
Running a fine-tuned model locally involves upfront hardware costs but can save money at scale. A single A100 GPU can handle dozens of requests per minute, and once you have the hardware, each inference costs only electricity and maintenance. For high-volume applications, this approach often proves more economical than paying per-token API fees.
Summary
Exporting your fine-tuning data from OpenAI and training a local model gives you more control and potentially lower costs. The process involves retrieving your training data, selecting an appropriate base model, fine-tuning locally using libraries like Unsloth, and deploying for inference. While the resulting model will not be identical to OpenAI’s hosted version, it captures the knowledge from your training data and runs entirely on your infrastructure.
Related Reading
Built by theluckystrike — More at zovo.one