Deploying YOLOv8 models on edge devices requires careful model optimization to balance accuracy against resource constraints. This post provides comprehensive implementation guidance for training YOLOv8 models, applying post-training quantization (PTQ) and quantization-aware training (QAT), exporting to multiple inference formats, validating accuracy preservation, and benchmarking performance gains. Practical examples demonstrate achieving 4x model compression with 1.5-2.75x inference speedup while maintaining acceptable accuracy degradation below 2% mAP.
Part 1 established foundational concepts for edge CNN deployment. This post focuses on practical implementation: setting up YOLOv8 training environments, understanding dataset requirements, implementing INT8 quantization workflows, comparing PTQ versus QAT approaches, exporting models to ONNX and TensorRT formats, establishing validation methodologies, and benchmarking quantized model performance.
YOLOv8 Training Environment Setup
Before implementing quantization workflows, establish proper training infrastructure with required dependencies and validated dataset preparation.
Python Environment Configuration:
# Python 3.8+ environment setup
pip install ultralytics torch torchvision onnx onnxruntime
pip install tensorrt # NVIDIA TensorRT for deployment targets
# Verify installations
python -c "import ultralytics; print(ultralytics.__version__)"
python -c "import torch; print(torch.__version__)"
Dataset Preparation Requirements: YOLOv8 expects datasets in specific formats with proper directory structure. For custom datasets, organize as follows:
dataset/
├── images/
│ ├── train/
│ │ ├── image1.jpg
│ │ └── image2.jpg
│ └── val/
│ ├── image3.jpg
│ └── image4.jpg
├── labels/
│ ├── train/
│ │ ├── image1.txt
│ │ └── image2.txt
│ └── val/
│ ├── image3.txt
│ └── image4.txt
└── dataset.yaml
Dataset configuration file (dataset.yaml) defines paths and classes:
# dataset.yaml
path: /path/to/dataset
train: images/train
val: images/val
names:
0: person
1: vehicle
2: bicycle
Label files use YOLO format with one line per object: class_id x_center y_center width height (normalized 0-1 coordinates).
YOLOv8 Training Implementation
Training YOLOv8 models establishes baseline accuracy before applying quantization. Training configuration significantly impacts quantization-friendliness of resulting models.
Basic Training Script (Python):
from ultralytics import YOLO
# Load pretrained model or create new
model = YOLO('yolov8n.pt') # nano model for edge deployment
# Training configuration
results = model.train(
data='dataset.yaml',
epochs=100,
imgsz=640,
batch=16,
device=0, # GPU device
optimizer='AdamW',
lr0=0.001,
weight_decay=0.0005,
augment=True,
mosaic=1.0,
mixup=0.0,
patience=50,
save=True,
project='runs/train',
name='yolov8n_baseline'
)
# Validation
metrics = model.val()
print(f"Baseline mAP50: {metrics.box.map50:.4f}")
print(f"Baseline mAP50-95: {metrics.box.map:.4f}")
Training Configuration Considerations: Image size impacts quantization effectiveness. Standard 640×640 provides good balance between accuracy and edge performance. Smaller sizes (320×320 or 416×416) further reduce compute requirements but sacrifice accuracy. Data augmentation improves model robustness but training time increases by 20-30%. Mosaic augmentation at 1.0 strength recommended for diverse datasets. Patience value controls early stopping. Set to 50 epochs for adequate convergence verification.
Training produces baseline model used for quantization. Typical YOLOv8n baseline achieves 37-40 mAP50-95 on COCO dataset, 45-48 mAP50-95 on custom datasets with fewer classes and controlled environments. Training time ranges from 2-4 hours on single RTX 3090 for 100 epochs with standard datasets.
Post-Training Quantization (PTQ) Implementation
PTQ converts trained FP32 models to INT8 representation without retraining. This approach offers fastest quantization workflow suitable for most edge deployments.
flowchart TD
A[Trained FP32 Model] --> B[Export to ONNX]
B --> C[Load Calibration Dataset]
C --> D[Run Calibration Pass]
D --> E[Compute Quantization Parameters]
E --> F[Apply INT8 Conversion]
F --> G[Validate Accuracy]
G --> H{Accuracy Acceptable?}
H -->|Yes| I[Deploy INT8 Model]
H -->|No| J[Adjust Calibration]
J --> D
H -->|Still Poor| K[Consider QAT]PTQ with TensorRT (Python):
from ultralytics import YOLO
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
# Load trained model
model = YOLO('runs/train/yolov8n_baseline/weights/best.pt')
# Export to ONNX first
model.export(format='onnx', imgsz=640, opset=12)
# TensorRT INT8 calibration
class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, calibration_dataset, batch_size=8):
trt.IInt8EntropyCalibrator2.__init__(self)
self.dataset = calibration_dataset
self.batch_size = batch_size
self.current_index = 0
# Allocate device memory
self.device_input = cuda.mem_alloc(
batch_size * 3 * 640 * 640 * np.dtype(np.float32).itemsize
)
def get_batch_size(self):
return self.batch_size
def get_batch(self, names):
if self.current_index + self.batch_size > len(self.dataset):
return None
batch = []
for i in range(self.batch_size):
img = self.dataset[self.current_index + i]
batch.append(img)
self.current_index += self.batch_size
batch_array = np.concatenate(batch, axis=0)
cuda.memcpy_htod(self.device_input, batch_array)
return [int(self.device_input)]
def read_calibration_cache(self):
return None
def write_calibration_cache(self, cache):
with open('calibration.cache', 'wb') as f:
f.write(cache)
# Load calibration dataset (500-1000 representative images)
calibration_images = []
for img_path in calibration_image_paths:
img = cv2.imread(img_path)
img = cv2.resize(img, (640, 640))
img = img.transpose(2, 0, 1).astype(np.float32) / 255.0
img = np.expand_dims(img, axis=0)
calibration_images.append(img)
calibrator = EntropyCalibrator(calibration_images, batch_size=8)
# Create TensorRT builder with INT8 precision
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# Parse ONNX model
with open('runs/train/yolov8n_baseline/weights/best.onnx', 'rb') as model_file:
parser.parse(model_file.read())
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = calibrator
config.max_workspace_size = 1 << 30 # 1GB
# Build INT8 engine
engine = builder.build_engine(network, config)
# Save engine
with open('yolov8n_int8.engine', 'wb') as f:
f.write(engine.serialize())
print("INT8 TensorRT engine created successfully")
PTQ with ONNX Runtime (Python):
from onnxruntime.quantization import quantize_dynamic, quantize_static
from onnxruntime.quantization import CalibrationDataReader
import onnx
class YOLOCalibrationDataReader(CalibrationDataReader):
def __init__(self, calibration_dataset):
self.dataset = calibration_dataset
self.iterator = iter(self.dataset)
def get_next(self):
try:
batch = next(self.iterator)
return {"images": batch}
except StopIteration:
return None
# Static quantization (INT8)
model_input = 'runs/train/yolov8n_baseline/weights/best.onnx'
model_output = 'yolov8n_int8_onnx.onnx'
calibration_reader = YOLOCalibrationDataReader(calibration_images)
quantize_static(
model_input,
model_output,
calibration_data_reader=calibration_reader,
quant_format='QDQ',
per_channel=True,
activation_type='QUInt8',
weight_type='QInt8'
)
print(f"Quantized model saved to {model_output}")
# Verify model
quantized_model = onnx.load(model_output)
onnx.checker.check_model(quantized_model)
PTQ Calibration Dataset Selection: Calibration data significantly impacts quantization quality. Use 500-1000 images representative of deployment scenarios, covering diverse lighting conditions, object scales, and backgrounds. Include challenging cases (occlusions, motion blur, low contrast) to ensure robust quantization parameters. Avoid using validation set for calibration to maintain unbiased accuracy evaluation.
Quantization-Aware Training (QAT) Implementation
QAT fine-tunes models with simulated quantization effects, recovering accuracy lost during PTQ. Implement QAT when PTQ accuracy degradation exceeds 2% mAP or deployment requires maximum accuracy.
QAT Training Script (Python with PyTorch):
import torch
from ultralytics import YOLO
import torch.quantization
# Load pretrained FP32 model
model = YOLO('runs/train/yolov8n_baseline/weights/best.pt')
# Prepare model for QAT
qat_model = torch.quantization.prepare_qat(
model.model,
inplace=False
)
# QAT fine-tuning configuration
qat_results = model.train(
data='dataset.yaml',
epochs=20, # 10-20 epochs for QAT fine-tuning
imgsz=640,
batch=16,
device=0,
optimizer='AdamW',
lr0=0.0001, # Reduced learning rate for fine-tuning
weight_decay=0.0005,
augment=True,
patience=10,
save=True,
project='runs/qat',
name='yolov8n_qat',
pretrained=True
)
# Convert to quantized model
torch.quantization.convert(qat_model, inplace=True)
# Export quantized model
qat_model.export(format='onnx', imgsz=640)
print("QAT completed successfully")
QAT with Ultralytics Native Support: Recent Ultralytics versions provide integrated QAT support simplifying implementation:
from ultralytics import YOLO
# Load baseline model
model = YOLO('runs/train/yolov8n_baseline/weights/best.pt')
# Export with INT8 quantization (applies PTQ automatically)
model.export(
format='engine', # TensorRT
int8=True,
data='dataset.yaml', # Uses dataset for calibration
batch=8,
workspace=4 # 4GB workspace
)
# For QAT-style fine-tuning, use continue training with frozen backbone
model.train(
data='dataset.yaml',
epochs=10,
freeze=10, # Freeze first 10 layers
lr0=0.00001,
batch=8
)
QAT Training Considerations: Use reduced learning rate (10x smaller than initial training) to avoid destabilizing quantization-aware weights. Fine-tune for 10-20 epochs, monitoring validation mAP convergence. Freeze early layers (typically first 10 layers) to maintain learned features while adapting later layers to quantization constraints. QAT adds 20-30% to training time compared to standard training but typically recovers 0.5-1.5% mAP compared to PTQ.
Model Export Formats and Considerations
Different edge platforms require specific model formats. Understanding export options ensures optimal deployment compatibility.
ONNX Export (Universal Format):
from ultralytics import YOLO
model = YOLO('runs/train/yolov8n_baseline/weights/best.pt')
# Export to ONNX
model.export(
format='onnx',
imgsz=640,
opset=12, # ONNX opset version
simplify=True, # Simplify ONNX graph
dynamic=False # Static shapes for optimization
)
# Verify ONNX model
import onnx
onnx_model = onnx.load('runs/train/yolov8n_baseline/weights/best.onnx')
onnx.checker.check_model(onnx_model)
print("ONNX model validated successfully")
ONNX provides broad compatibility across platforms including ONNX Runtime, TensorRT, OpenVINO, and TensorFlow Lite converters. Use static shapes for maximum optimization. Dynamic shapes reduce optimization opportunities and increase runtime overhead.
TensorRT Export (NVIDIA Platforms):
from ultralytics import YOLO
model = YOLO('runs/train/yolov8n_baseline/weights/best.pt')
# Direct TensorRT export with INT8
model.export(
format='engine',
imgsz=640,
int8=True,
data='dataset.yaml', # For INT8 calibration
workspace=4, # 4GB GPU memory workspace
device=0
)
print("TensorRT engine created: best.engine")
TensorRT engines are platform-specific (compiled for specific GPU architecture). Generate engines on target hardware or use matching CUDA compute capability. TensorRT provides best performance on NVIDIA platforms with 2-5x speedup over generic frameworks.
TensorFlow Lite Export (Mobile/Embedded):
from ultralytics import YOLO
model = YOLO('runs/train/yolov8n_baseline/weights/best.pt')
# Export to TFLite with INT8 quantization
model.export(
format='tflite',
imgsz=640,
int8=True,
data='dataset.yaml'
)
print("TFLite model created: best_int8.tflite")
TFLite suitable for Raspberry Pi, Android devices, and Coral TPU deployment. Coral TPU requires additional compilation step using Edge TPU Compiler after TFLite generation.
Validation Methodology for Quantized Models
Rigorous validation ensures quantized models meet accuracy requirements before deployment. Establish baseline metrics, measure quantization impact, and validate across diverse conditions.
Comprehensive Validation Script (Python):
from ultralytics import YOLO
import torch
import numpy as np
def validate_model(model_path, data_yaml, model_type='pt'):
"""Comprehensive model validation"""
if model_type == 'pt':
model = YOLO(model_path)
elif model_type == 'onnx':
model = YOLO(model_path, task='detect')
elif model_type == 'engine':
model = YOLO(model_path, task='detect')
# Standard validation metrics
results = model.val(data=data_yaml, imgsz=640, batch=16)
metrics = {
'mAP50': results.box.map50,
'mAP50-95': results.box.map,
'precision': results.box.p.mean(),
'recall': results.box.r.mean()
}
print(f"\n{model_type.upper()} Model Validation:")
print(f"mAP50: {metrics['mAP50']:.4f}")
print(f"mAP50-95: {metrics['mAP50-95']:.4f}")
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall: {metrics['recall']:.4f}")
return metrics
# Validate baseline FP32 model
baseline_metrics = validate_model(
'runs/train/yolov8n_baseline/weights/best.pt',
'dataset.yaml',
'pt'
)
# Validate INT8 quantized model
quantized_metrics = validate_model(
'yolov8n_int8.engine',
'dataset.yaml',
'engine'
)
# Calculate accuracy degradation
mAP_degradation = (baseline_metrics['mAP50-95'] -
quantized_metrics['mAP50-95'])
degradation_percent = (mAP_degradation / baseline_metrics['mAP50-95']) * 100
print(f"\nQuantization Impact:")
print(f"mAP degradation: {mAP_degradation:.4f} ({degradation_percent:.2f}%)")
# Validation passes if degradation < 2%
if degradation_percent < 2.0:
print("✓ Quantization validation PASSED")
else:
print("✗ Quantization validation FAILED - Consider QAT")
Per-Class Accuracy Analysis: Validate quantization impact across individual classes to identify problematic categories:
def per_class_validation(model_path, data_yaml):
"""Per-class accuracy analysis"""
model = YOLO(model_path)
results = model.val(data=data_yaml)
# Extract per-class metrics
class_metrics = {}
for idx, class_name in enumerate(results.names.values()):
class_metrics[class_name] = {
'AP50': results.box.ap50[idx],
'AP50-95': results.box.ap[idx]
}
print("\nPer-Class Accuracy:")
for class_name, metrics in class_metrics.items():
print(f"{class_name}: AP50={metrics['AP50']:.4f}, "
f"AP50-95={metrics['AP50-95']:.4f}")
return class_metrics
# Compare baseline vs quantized per-class
baseline_classes = per_class_validation(
'runs/train/yolov8n_baseline/weights/best.pt',
'dataset.yaml'
)
quantized_classes = per_class_validation(
'yolov8n_int8.engine',
'dataset.yaml'
)
Performance Benchmarking
Quantify performance improvements from quantization through systematic benchmarking covering inference latency, throughput, model size, and memory usage.
Comprehensive Benchmark Script (Python):
import time
import torch
import numpy as np
from ultralytics import YOLO
import os
def benchmark_model(model_path, num_iterations=100, warmup=10):
"""Comprehensive model benchmarking"""
model = YOLO(model_path)
# Prepare dummy input
dummy_input = torch.randn(1, 3, 640, 640).cuda()
# Warmup runs
for _ in range(warmup):
model.predict(dummy_input, verbose=False)
# Benchmark inference
latencies = []
for _ in range(num_iterations):
start = time.perf_counter()
model.predict(dummy_input, verbose=False)
torch.cuda.synchronize()
latencies.append((time.perf_counter() - start) * 1000) # ms
# Calculate statistics
latencies = np.array(latencies)
metrics = {
'mean_latency': np.mean(latencies),
'std_latency': np.std(latencies),
'p50_latency': np.percentile(latencies, 50),
'p95_latency': np.percentile(latencies, 95),
'p99_latency': np.percentile(latencies, 99),
'throughput': 1000 / np.mean(latencies), # FPS
'model_size': os.path.getsize(model_path) / (1024**2) # MB
}
return metrics
# Benchmark baseline FP32
print("Benchmarking FP32 baseline...")
fp32_metrics = benchmark_model('runs/train/yolov8n_baseline/weights/best.pt')
# Benchmark INT8 quantized
print("Benchmarking INT8 quantized...")
int8_metrics = benchmark_model('yolov8n_int8.engine')
# Calculate improvements
speedup = fp32_metrics['mean_latency'] / int8_metrics['mean_latency']
size_reduction = fp32_metrics['model_size'] / int8_metrics['model_size']
print("\nPerformance Comparison:")
print(f"FP32 Latency: {fp32_metrics['mean_latency']:.2f}ms "
f"(±{fp32_metrics['std_latency']:.2f}ms)")
print(f"INT8 Latency: {int8_metrics['mean_latency']:.2f}ms "
f"(±{int8_metrics['std_latency']:.2f}ms)")
print(f"Speedup: {speedup:.2f}x")
print(f"\nFP32 Model Size: {fp32_metrics['model_size']:.2f}MB")
print(f"INT8 Model Size: {int8_metrics['model_size']:.2f}MB")
print(f"Size Reduction: {size_reduction:.2f}x")
print(f"\nFP32 Throughput: {fp32_metrics['throughput']:.1f} FPS")
print(f"INT8 Throughput: {int8_metrics['throughput']:.1f} FPS")
Expected Performance Gains: Typical quantization results on NVIDIA Jetson platforms show model size reduction of 3.8-4.2x (FP32 to INT8), inference speedup of 1.5-2.75x depending on platform (higher on newer Jetson Orin), accuracy degradation of 0.5-1.8% mAP50-95 with proper calibration, and memory bandwidth reduction of approximately 4x enabling higher concurrent inference loads.
Performance varies significantly by hardware platform. Jetson Nano achieves 1.5-2x speedup, Jetson Xavier NX achieves 2-2.5x speedup, Jetson Orin achieves 2.5-2.75x speedup, and Raspberry Pi with Coral TPU achieves 3-4x speedup for INT8-optimized inference.
Practical Implementation Example: End-to-End Workflow
Complete workflow demonstrating training, quantization, validation, and benchmarking for production deployment:
#!/usr/bin/env python3
"""
Complete YOLOv8 quantization workflow
"""
from ultralytics import YOLO
import torch
import time
class YOLOv8QuantizationPipeline:
def __init__(self, data_yaml, project_name='yolov8_quantization'):
self.data_yaml = data_yaml
self.project_name = project_name
self.baseline_model = None
self.quantized_model = None
def train_baseline(self, epochs=100, imgsz=640, batch=16):
"""Train baseline FP32 model"""
print("Training baseline FP32 model...")
model = YOLO('yolov8n.pt')
results = model.train(
data=self.data_yaml,
epochs=epochs,
imgsz=imgsz,
batch=batch,
device=0,
project=f'runs/{self.project_name}',
name='baseline'
)
self.baseline_model = f'runs/{self.project_name}/baseline/weights/best.pt'
# Validate baseline
metrics = model.val()
print(f"Baseline mAP50-95: {metrics.box.map:.4f}")
return metrics
def apply_quantization(self, method='ptq'):
"""Apply INT8 quantization"""
print(f"Applying {method.upper()} quantization...")
model = YOLO(self.baseline_model)
if method == 'ptq':
# PTQ via TensorRT export
model.export(
format='engine',
imgsz=640,
int8=True,
data=self.data_yaml,
workspace=4
)
self.quantized_model = self.baseline_model.replace('.pt', '.engine')
elif method == 'qat':
# QAT fine-tuning
qat_results = model.train(
data=self.data_yaml,
epochs=15,
imgsz=640,
batch=16,
lr0=0.0001,
freeze=10,
project=f'runs/{self.project_name}',
name='qat'
)
# Export quantized
model.export(format='engine', imgsz=640, int8=True)
self.quantized_model = f'runs/{self.project_name}/qat/weights/best.engine'
print(f"Quantized model: {self.quantized_model}")
def validate_quantization(self):
"""Validate quantized model accuracy"""
print("Validating quantized model...")
baseline = YOLO(self.baseline_model)
quantized = YOLO(self.quantized_model)
baseline_metrics = baseline.val(data=self.data_yaml)
quantized_metrics = quantized.val(data=self.data_yaml)
degradation = (baseline_metrics.box.map - quantized_metrics.box.map)
degradation_pct = (degradation / baseline_metrics.box.map) * 100
print(f"\nAccuracy Comparison:")
print(f"Baseline mAP50-95: {baseline_metrics.box.map:.4f}")
print(f"Quantized mAP50-95: {quantized_metrics.box.map:.4f}")
print(f"Degradation: {degradation:.4f} ({degradation_pct:.2f}%)")
return degradation_pct < 2.0 # Pass if < 2% degradation
def benchmark_performance(self):
"""Benchmark inference performance"""
print("Benchmarking performance...")
def benchmark(model_path, iterations=100):
model = YOLO(model_path)
dummy = torch.randn(1, 3, 640, 640).cuda()
# Warmup
for _ in range(10):
model.predict(dummy, verbose=False)
# Benchmark
start = time.perf_counter()
for _ in range(iterations):
model.predict(dummy, verbose=False)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
return (elapsed / iterations) * 1000 # ms per inference
baseline_latency = benchmark(self.baseline_model)
quantized_latency = benchmark(self.quantized_model)
speedup = baseline_latency / quantized_latency
print(f"\nPerformance Comparison:")
print(f"Baseline latency: {baseline_latency:.2f}ms")
print(f"Quantized latency: {quantized_latency:.2f}ms")
print(f"Speedup: {speedup:.2f}x")
return speedup
# Execute complete pipeline
if __name__ == '__main__':
pipeline = YOLOv8QuantizationPipeline('dataset.yaml')
# Train baseline
pipeline.train_baseline(epochs=100)
# Apply PTQ
pipeline.apply_quantization(method='ptq')
# Validate
passes_validation = pipeline.validate_quantization()
if not passes_validation:
print("PTQ validation failed, trying QAT...")
pipeline.apply_quantization(method='qat')
pipeline.validate_quantization()
# Benchmark
speedup = pipeline.benchmark_performance()
print(f"\nPipeline complete! Achieved {speedup:.2f}x speedup")
Key Takeaways
YOLOv8 quantization enables practical edge deployment through significant model compression and inference acceleration. PTQ provides fastest quantization workflow suitable for most scenarios, achieving 4x model size reduction and 1.5-2.75x inference speedup with typical accuracy degradation under 2% mAP. QAT recovers 0.5-1.5% accuracy compared to PTQ when PTQ degradation exceeds acceptable thresholds.
Calibration dataset selection critically impacts quantization quality. Use 500-1000 representative images covering diverse deployment conditions. Export format selection depends on target platform: TensorRT for NVIDIA Jetson, TFLite for mobile/Coral TPU, ONNX for cross-platform flexibility. Rigorous validation ensures quantized models meet production accuracy requirements through comprehensive metric comparison and per-class analysis.
Performance benchmarking quantifies quantization benefits across latency, throughput, model size, and memory usage. Typical gains include 1.5-2.75x speedup on NVIDIA platforms, 4x model size reduction, and proportional memory bandwidth reduction enabling higher concurrent inference loads. Understanding quantization workflows enables confident deployment of optimized models meeting edge device constraints while maintaining acceptable accuracy.
Part 3 continues with practical NVIDIA Jetson deployment, covering JetPack environment setup, TensorRT compilation procedures, INT8 calibration on target hardware, performance tuning strategies, and thermal management techniques for sustained inference rates across Jetson Nano through AGX Orin platforms.
References
- Ultralytics YOLOv8 Training Documentation (https://docs.ultralytics.com/modes/train/)
- Ultralytics YOLOv8 Export Formats (https://docs.ultralytics.com/modes/export/)
- PyTorch Quantization Documentation (https://pytorch.org/docs/stable/quantization.html)
- NVIDIA TensorRT INT8 Calibration Guide (https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/)
- ONNX Runtime Quantization (https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html)
- TensorFlow Lite Post-Training Quantization (https://www.tensorflow.org/lite/performance/post_training_quantization)
- Quantization and Training of Neural Networks (https://arxiv.org/abs/1806.08342)
- Ultralytics Community Discussions (https://github.com/ultralytics/ultralytics/discussions)
