How to Specialize 7B·70B Models on a Single GPU — LoRA·QLoRA·PEFT Principles and Practical Code

Have you ever given up on LLM fine-tuning because you didn't have an A100 cluster? I have. After learning that fully fine-tuning GPT-3 175B requires infrastructure costing millions of won, I stepped away for a while — but discovering LoRA completely changed my perspective. With a single RTX 4090, you can fine-tune a 7B model, and with QLoRA, even 70B-class models become accessible on consumer GPUs.

LoRA starts from a fairly simple but powerful idea: "You don't need to update the entire model — just training two small matrices is enough to recover most of the performance." Proposed by a Microsoft research team in 2021, this approach has become the de facto standard in the LLM fine-tuning ecosystem and is expanding into image generation and speech models as well.

This article covers everything in one place: why LoRA works from first principles, how to apply it with actual training code including SFTTrainer, and the pitfalls you'll commonly encounter in practice.

Core Concepts

What Is PEFT — The Big Picture LoRA Belongs To

Before diving in, let me clarify one term. PEFT (Parameter-Efficient Fine-Tuning) refers to the entire family of techniques that specialize a model using only a small number of parameters without updating all weights. Prefix Tuning, Adapter Layers, and Prompt Tuning all fall under this umbrella, but in practice today, LoRA and its derivative QLoRA are overwhelmingly dominant.

Why LoRA Works — The Low-Rank Hypothesis

The core of LoRA is the hypothesis that "the intrinsic rank of weight updates in LLMs is much lower than you'd think." In plain terms: you don't need to update an entire matrix of thousands × thousands — touching just a few key directions within it is enough.

Expressed mathematically:

W' = W₀ + ΔW = W₀ + B·A

W₀: Frozen pre-trained weights (d×d)
A: Rank-r matrix (r×d) — initialized with a Gaussian distribution
B: Rank-r matrix (d×r) — initialized to zero (no effect at the start of training)
r ≪ d: The smaller the rank, the fewer parameters to update

Matrix size reference: Here, d is the model's hidden layer size. LLaMA-3.1 8B has d=4096, and 70B is around d=8192. Computing B(d×r)·A(r×d) yields d×d, matching the shape of W₀. If matrix equations aren't your thing, you can skip this part without losing the big picture.

Compared to full fine-tuning of GPT-3 175B, trainable parameters are reduced by 10,000× and GPU memory by 3×, while recovering 90–95% of full fine-tuning performance (figures from the original paper for GPT-3; actual results vary by task and dataset). That's barely even a tradeoff — at least when the conditions are right.

LoRA vs. Full Fine-Tuning: Full fine-tuning directly modifies the existing weights, whereas LoRA keeps the existing weights completely frozen and only trains the additional matrices. Once training is complete, the two matrices can be merged so there is zero extra computation overhead at inference time.

Understanding the Key Hyperparameters

When first applying LoRA, it can feel overwhelming to know what values to use. The table below serves as a starting point.

Parameter	Meaning	Recommended Starting Value
`r` (rank)	Dimension of the low-rank matrices. Lower = fewer parameters	Start with 4–16
`lora_alpha`	Scaling factor. `alpha/r` is the effective scale applied	Recommend setting to `2×r`
`target_modules`	Layers to apply LoRA to	Recommend `"all-linear"`
`lora_dropout`	Dropout for overfitting prevention	0.05–0.1
`learning_rate`	Learning rate. Set higher for LoRA than full fine-tuning	2e-4 ~ 1e-3

lora_alpha is an easy parameter to get confused about. The effective scale applied is alpha/r — with r=16, alpha=32, the scale is 2. Conventionally, using alpha = 2×r as a default and tuning from there is the safe approach.

LoRA Variants — The Ecosystem Post-2025

Starting from plain LoRA, there are now quite a few variants to choose from depending on your goal.

Method	Characteristics	When to Use
QLoRA	4-bit quantization (NF4) + LoRA combination	When memory is tight. For 7B on an RTX 4090, or for 70B on consumer GPUs
DoRA	Decomposes weights into magnitude and direction	Stable performance at small ranks, lower hyperparameter sensitivity
rsLoRA	Adjusts scale factor to `alpha/√r`	Stabilizes training at high ranks (r≥64)

QLoRA tradeoffs: 4-bit quantization dramatically reduces memory, but can introduce an additional 5–15% performance loss compared to LoRA alone. If you have sufficient memory, using LoRA alone is the safer option.

Practical Application

Example 1: Fine-Tuning LLaMA-3.1 with HuggingFace PEFT

This is the most fundamental setup. Use this code as a template when first getting started.

python

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
import torch
 
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
 
# LoRA configuration
config = LoraConfig(
    r=16,                          # Starting value. Adjust based on validation loss
    lora_alpha=32,                 # Default of 2×r
    target_modules="all-linear",   # Target all linear layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
 
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196
 
# Load dataset
dataset = load_dataset("your-dataset", split="train")
 
# Training configuration
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,            # Higher than full fine-tuning (~1e-5)
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,   # Use with early stopping
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
)
 
trainer.train()
model.save_pretrained("./lora-output")  # Saves only the adapter. ~1% of base model size

The output confirms that only about 0.5% of total parameters are being trained. That's the core appeal of LoRA.

Code Element	Description
`torch_dtype=torch.bfloat16`	Reduces memory. Recommended for A100/H100
`device_map="auto"`	Automatic multi-GPU distribution
`target_modules="all-linear"`	Post-2025 benchmarks consistently show better results than applying only to q_proj+v_proj
`learning_rate=2e-4`	LoRA tends to work well with higher learning rates due to fewer updated parameters
`load_best_model_at_end=True`	Automatically restores the best checkpoint based on validation loss

Example 2: Breaking Through Memory Constraints with QLoRA

If your GPU memory is 24 GB or less, QLoRA is the practical choice. You can fine-tune a 7B model on a single RTX 4090, and with QLoRA, even 70B models become worth attempting on consumer GPUs (though 70B comes with significant constraints on batch size and sequence length).

python

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
import torch
 
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,  # Double quantization for additional memory savings
    bnb_4bit_quant_type="nf4",       # NormalFloat4 — recommended by QLoRA paper
    bnb_4bit_compute_dtype=torch.bfloat16
)
 
# Choose 7B or 70B (70B is not possible on a single consumer GPU without QLoRA)
model_name = "meta-llama/Llama-3.1-8B"
# model_name = "meta-llama/Llama-3.1-70B"  # 70B: QLoRA required, 40GB+ VRAM recommended
 
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
 
# Prepare 4-bit model for fine-tuning
model = prepare_model_for_kbit_training(model)
 
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
 
model = get_peft_model(model, config)
model.print_trainable_parameters()
 
dataset = load_dataset("your-dataset", split="train")
 
training_args = TrainingArguments(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,  # Compensates for reduced batch size under memory constraints
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
)
 
trainer.train()
model.save_pretrained("./qlora-output")

Code Element	Description
`bnb_4bit_use_double_quant`	Re-quantizes the quantization constants. Saves an additional 0.4 bits
`nf4`	NormalFloat4. Data type optimized for normally distributed weights
`prepare_model_for_kbit_training`	Prepares the 4-bit model to correctly handle gradients
`gradient_accumulation_steps=8`	A way to maintain effective batch size while reducing per-step batch size

Example 3: Boosting Training Speed 2–5× with Unsloth

When time is money, Unsloth is a game changer. It uses custom Triton kernels optimized for Flash Attention patterns to process attention computations without GPU memory bottlenecks, delivering 2–5× faster training speed and up to 80% memory reduction on the same hardware.

python

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.1-8B",
    max_seq_length=2048,
    dtype=None,          # Auto-detect
    load_in_4bit=True,   # QLoRA mode
)
 
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules="all-linear",
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Additional 30% memory reduction
    random_state=42,
)
 
dataset = load_dataset("your-dataset", split="train")
 
training_args = TrainingArguments(
    output_dir="./unsloth-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
)
 
trainer.train()
model.save_pretrained("./unsloth-output")

Unsloth caveat: The open-source version only supports a single GPU. If you need a multi-GPU setup, you'll need the Pro version, or consider Axolotl as an alternative.

Pros and Cons Analysis

Honestly, when I first used LoRA, I was skeptical — "does this actually work?" Only after using it in practice did I realize there are clear situations where it shines and situations where it doesn't.

Advantages

Item	Details
Memory efficiency	3–20× reduction in GPU memory compared to full fine-tuning
Training speed	Only 0.1–1% of total parameters updated
Small-data stability	Fewer parameters means less overfitting
No inference overhead	Merging the adapter into the base model results in zero extra cost at inference
Multi-adapter operation	Swap task-specific adapters on a single base model. Example: run per-client or per-language adapters on top of one base model
Accessibility	Fine-tune 7B models on a consumer GPU (RTX 4090); 70B-class models possible with QLoRA

Disadvantages and Caveats

LoRA is not a silver bullet. Misunderstanding this can lead to painful surprises in production.

Item	Details	Mitigation
Performance gap	May be 5–10% behind full fine-tuning on tasks requiring peak performance	Try DoRA or `all-linear` + high rank
Hyperparameter sensitivity	Results vary significantly depending on `r`, `alpha`, and `target_modules` choices	Experiment sequentially: `r=4→8→16`
No architecture changes	Not suitable for adding new layers, changing embedding dimensions, or other fundamental modifications	Consider full fine-tuning in these cases
Adapter version management	Managing compatibility across many domain-specific adapters becomes complex	Use experiment tracking tools like W&B
QLoRA accuracy loss	4-bit quantization can cause an additional 5–15% performance loss compared to LoRA alone	Use LoRA alone if memory allows

The Most Common Mistakes in Practice

Setting r too high from the start — Starting with r=64 wastes time and memory. Starting at r=4 or r=8 and increasing while watching validation loss is far more efficient.
Skipping the baseline measurement — Without measuring the base model's performance before fine-tuning, you have no way of knowing how much you've actually improved. Honestly, I skipped this often early on and regretted it later.
Watching only training loss without a validation set — Even if training loss keeps decreasing, if validation loss starts rising, you're overfitting. It's recommended to configure early stopping alongside load_best_model_at_end=True.

Closing Thoughts

LoRA has changed the reality: you can now specialize an LLM for your service with just a single GPU, consumer hardware, and a small amount of domain data. The principle is simple, so the barrier to entry is low — but you have to actually run the code yourself to truly feel why this technique became the ecosystem standard.

Three steps you can take right now:

Run your first experiment for free on Google Colab — Install pip install peft transformers bitsandbytes trl datasets and paste the QLoRA code above directly. Fine-tuning a 7B model is possible even on the Colab T4 free tier (with constraints like batch size 1 and sequence length around 512).
Verify your setup with model.print_trainable_parameters() — This single line before training starts shows you exactly what percentage of parameters are being trained. Try changing combinations of r and target_modules and experiment.
Test for overfitting with a small dataset — First confirm whether the model overfits on a small set of 100 training examples. If it does, that's a signal your pipeline is working correctly.

References

#LoRA#QLoRA#PEFT#LLM파인튜닝#HuggingFace#Unsioth#양자화#SFTTrainer#LLaMA#Python

How to Specialize 7B·70B Models on a Single GPU — LoRA·QLoRA·PEFT Principles and Practical Code | DEV BAK - 기술블로그

How to Specialize 7B·70B Models on a Single GPU — LoRA·QLoRA·PEFT Principles and Practical Code

Core Concepts

What Is PEFT — The Big Picture LoRA Belongs To

Why LoRA Works — The Low-Rank Hypothesis

Expressed mathematically:

W' = W₀ + ΔW = W₀ + B·A

W₀: Frozen pre-trained weights (d×d)
A: Rank-r matrix (r×d) — initialized with a Gaussian distribution
B: Rank-r matrix (d×r) — initialized to zero (no effect at the start of training)
r ≪ d: The smaller the rank, the fewer parameters to update

Matrix size reference: Here, d is the model's hidden layer size. LLaMA-3.1 8B has d=4096, and 70B is around d=8192. Computing B(d×r)·A(r×d) yields d×d, matching the shape of W₀. If matrix equations aren't your thing, you can skip this part without losing the big picture.

LoRA vs. Full Fine-Tuning: Full fine-tuning directly modifies the existing weights, whereas LoRA keeps the existing weights completely frozen and only trains the additional matrices. Once training is complete, the two matrices can be merged so there is zero extra computation overhead at inference time.

Understanding the Key Hyperparameters

When first applying LoRA, it can feel overwhelming to know what values to use. The table below serves as a starting point.

Parameter	Meaning	Recommended Starting Value
`r` (rank)	Dimension of the low-rank matrices. Lower = fewer parameters	Start with 4–16
`lora_alpha`	Scaling factor. `alpha/r` is the effective scale applied	Recommend setting to `2×r`
`target_modules`	Layers to apply LoRA to	Recommend `"all-linear"`
`lora_dropout`	Dropout for overfitting prevention	0.05–0.1
`learning_rate`	Learning rate. Set higher for LoRA than full fine-tuning	2e-4 ~ 1e-3

LoRA Variants — The Ecosystem Post-2025

Starting from plain LoRA, there are now quite a few variants to choose from depending on your goal.

Method	Characteristics	When to Use
QLoRA	4-bit quantization (NF4) + LoRA combination	When memory is tight. For 7B on an RTX 4090, or for 70B on consumer GPUs
DoRA	Decomposes weights into magnitude and direction	Stable performance at small ranks, lower hyperparameter sensitivity
rsLoRA	Adjusts scale factor to `alpha/√r`	Stabilizes training at high ranks (r≥64)

QLoRA tradeoffs: 4-bit quantization dramatically reduces memory, but can introduce an additional 5–15% performance loss compared to LoRA alone. If you have sufficient memory, using LoRA alone is the safer option.

Practical Application

Example 1: Fine-Tuning LLaMA-3.1 with HuggingFace PEFT

This is the most fundamental setup. Use this code as a template when first getting started.

python

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
import torch
 
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
 
# LoRA configuration
config = LoraConfig(
    r=16,                          # Starting value. Adjust based on validation loss
    lora_alpha=32,                 # Default of 2×r
    target_modules="all-linear",   # Target all linear layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
 
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196
 
# Load dataset
dataset = load_dataset("your-dataset", split="train")
 
# Training configuration
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,            # Higher than full fine-tuning (~1e-5)
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,   # Use with early stopping
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
)
 
trainer.train()
model.save_pretrained("./lora-output")  # Saves only the adapter. ~1% of base model size

The output confirms that only about 0.5% of total parameters are being trained. That's the core appeal of LoRA.

Code Element	Description
`torch_dtype=torch.bfloat16`	Reduces memory. Recommended for A100/H100
`device_map="auto"`	Automatic multi-GPU distribution
`target_modules="all-linear"`	Post-2025 benchmarks consistently show better results than applying only to q_proj+v_proj
`learning_rate=2e-4`	LoRA tends to work well with higher learning rates due to fewer updated parameters
`load_best_model_at_end=True`	Automatically restores the best checkpoint based on validation loss

Example 2: Breaking Through Memory Constraints with QLoRA

python

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
import torch
 
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,  # Double quantization for additional memory savings
    bnb_4bit_quant_type="nf4",       # NormalFloat4 — recommended by QLoRA paper
    bnb_4bit_compute_dtype=torch.bfloat16
)
 
# Choose 7B or 70B (70B is not possible on a single consumer GPU without QLoRA)
model_name = "meta-llama/Llama-3.1-8B"
# model_name = "meta-llama/Llama-3.1-70B"  # 70B: QLoRA required, 40GB+ VRAM recommended
 
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
 
# Prepare 4-bit model for fine-tuning
model = prepare_model_for_kbit_training(model)
 
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
 
model = get_peft_model(model, config)
model.print_trainable_parameters()
 
dataset = load_dataset("your-dataset", split="train")
 
training_args = TrainingArguments(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,  # Compensates for reduced batch size under memory constraints
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
)
 
trainer.train()
model.save_pretrained("./qlora-output")

Code Element	Description
`bnb_4bit_use_double_quant`	Re-quantizes the quantization constants. Saves an additional 0.4 bits
`nf4`	NormalFloat4. Data type optimized for normally distributed weights
`prepare_model_for_kbit_training`	Prepares the 4-bit model to correctly handle gradients
`gradient_accumulation_steps=8`	A way to maintain effective batch size while reducing per-step batch size

Example 3: Boosting Training Speed 2–5× with Unsloth

python

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.1-8B",
    max_seq_length=2048,
    dtype=None,          # Auto-detect
    load_in_4bit=True,   # QLoRA mode
)
 
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules="all-linear",
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Additional 30% memory reduction
    random_state=42,
)
 
dataset = load_dataset("your-dataset", split="train")
 
training_args = TrainingArguments(
    output_dir="./unsloth-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
)
 
trainer.train()
model.save_pretrained("./unsloth-output")

Unsloth caveat: The open-source version only supports a single GPU. If you need a multi-GPU setup, you'll need the Pro version, or consider Axolotl as an alternative.

Pros and Cons Analysis

Advantages

Item	Details
Memory efficiency	3–20× reduction in GPU memory compared to full fine-tuning
Training speed	Only 0.1–1% of total parameters updated
Small-data stability	Fewer parameters means less overfitting
No inference overhead	Merging the adapter into the base model results in zero extra cost at inference
Multi-adapter operation	Swap task-specific adapters on a single base model. Example: run per-client or per-language adapters on top of one base model
Accessibility	Fine-tune 7B models on a consumer GPU (RTX 4090); 70B-class models possible with QLoRA

Disadvantages and Caveats

LoRA is not a silver bullet. Misunderstanding this can lead to painful surprises in production.

Item	Details	Mitigation
Performance gap	May be 5–10% behind full fine-tuning on tasks requiring peak performance	Try DoRA or `all-linear` + high rank
Hyperparameter sensitivity	Results vary significantly depending on `r`, `alpha`, and `target_modules` choices	Experiment sequentially: `r=4→8→16`
No architecture changes	Not suitable for adding new layers, changing embedding dimensions, or other fundamental modifications	Consider full fine-tuning in these cases
Adapter version management	Managing compatibility across many domain-specific adapters becomes complex	Use experiment tracking tools like W&B
QLoRA accuracy loss	4-bit quantization can cause an additional 5–15% performance loss compared to LoRA alone	Use LoRA alone if memory allows

The Most Common Mistakes in Practice

Setting r too high from the start — Starting with r=64 wastes time and memory. Starting at r=4 or r=8 and increasing while watching validation loss is far more efficient.
Skipping the baseline measurement — Without measuring the base model's performance before fine-tuning, you have no way of knowing how much you've actually improved. Honestly, I skipped this often early on and regretted it later.
Watching only training loss without a validation set — Even if training loss keeps decreasing, if validation loss starts rising, you're overfitting. It's recommended to configure early stopping alongside load_best_model_at_end=True.

Closing Thoughts

Three steps you can take right now:

Run your first experiment for free on Google Colab — Install pip install peft transformers bitsandbytes trl datasets and paste the QLoRA code above directly. Fine-tuning a 7B model is possible even on the Colab T4 free tier (with constraints like batch size 1 and sequence length around 512).
Verify your setup with model.print_trainable_parameters() — This single line before training starts shows you exactly what percentage of parameters are being trained. Try changing combinations of r and target_modules and experiment.
Test for overfitting with a small dataset — First confirm whether the model overfits on a small set of 100 training examples. If it does, that's a signal your pipeline is working correctly.

References

#LoRA#QLoRA#PEFT#LLM파인튜닝#HuggingFace#Unsioth#양자화#SFTTrainer#LLaMA#Python

Core Concepts

What Is PEFT — The Big Picture LoRA Belongs To

Why LoRA Works — The Low-Rank Hypothesis

Understanding the Key Hyperparameters

LoRA Variants — The Ecosystem Post-2025

Practical Application

Example 1: Fine-Tuning LLaMA-3.1 with HuggingFace PEFT

Example 2: Breaking Through Memory Constraints with QLoRA

Example 3: Boosting Training Speed 2–5× with Unsloth

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

What Is PEFT — The Big Picture LoRA Belongs To

Why LoRA Works — The Low-Rank Hypothesis

Understanding the Key Hyperparameters

LoRA Variants — The Ecosystem Post-2025

Practical Application

Example 1: Fine-Tuning LLaMA-3.1 with HuggingFace PEFT

Example 2: Breaking Through Memory Constraints with QLoRA

Example 3: Boosting Training Speed 2–5× with Unsloth

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

AI Keeps Running Even Without the Cloud — Implementing an Edge AI On-Device Deployment Pipeline

Trust Boundaries That Break When AI Agents Call External Tools — How to Prevent Prompt Injection and Memory Poisoning with MAESTRO and OWASP ASI Top 10

Building an MCP Server with TypeScript: Connecting PostgreSQL and Grafana to Hermes AI Agent

Cut LLM API Costs by Up to 80% — 5 Optimization Strategies Proven in GPT-4o & Claude Production

vLLM vs SGLang Performance Comparison — Choosing an Inference Engine Through the Lens of 2026 KV Cache Architecture

Building Type-Safe AI Agents with PydanticAI — How We Caught 23 Bugs Before Production