从TT100K到YOLO格式：一份避坑指南帮你搞定数据集转换与划分（附完整代码）

张

张建站

2026/5/20 21:52:32

10分钟阅读

从TT100K到YOLO格式：一份避坑指南帮你搞定数据集转换与划分（附完整代码）

从TT100K到YOLO格式交通标志检测数据集转换实战指南如果你正在使用YOLOv5或YOLOv8进行交通标志检测TT100K数据集可能已经进入你的视野。这个包含上万张图片的数据集看似理想但当你真正开始使用时会发现从原始数据到YOLO训练格式的转换过程中暗藏不少坑。本文将带你完整走通TT100K→COCO→YOLO的转换流程分享我在实际项目中的经验教训。1. 理解TT100K数据集的结构与挑战TT100K数据集全称为Tsinghua-Tencent 100K是由清华大学和腾讯联合发布的大型交通标志数据集。原始数据集包含超过10,000张高分辨率图像2048×2048标注了221类交通标志。但直接使用这个数据集进行YOLO训练会遇到几个关键问题类别不平衡221个类别中许多类别只有个位数样本而限速标志等常见类别则有上千样本格式不兼容原始标注是JSON格式与YOLO要求的txt格式不匹配数据分布不合理原始train/test/other划分方式不适合深度学习训练# TT100K原始数据结构示例 tt100k_dataset/ ├── annotations/ # JSON标注文件 ├── train/ # 训练集图片(6105张) ├── test/ # 测试集图片(3071张) └── other/ # 其他图片(7641张)提示在开始转换前建议先备份原始数据集所有操作都在副本上进行2. 数据清洗与类别筛选策略面对221个不均衡的类别直接使用所有数据训练效果往往不佳。我们需要先进行数据清洗和类别筛选统计类别分布解析所有JSON文件计算每个类别的出现次数设定阈值筛选通常保留样本数超过100的类别可根据具体需求调整处理多标签图像有些图像包含多个交通标志要确保只保留目标类别的标注import json from collections import defaultdict # 统计类别分布 def count_categories(annotation_dir): category_counts defaultdict(int) for ann_file in os.listdir(annotation_dir): with open(os.path.join(annotation_dir, ann_file)) as f: data json.load(f) for obj in data[objects]: category_counts[obj[category]] 1 return category_counts # 示例筛选样本数≥100的类别 category_counts count_categories(tt100k/annotations) selected_categories [cat for cat, count in category_counts.items() if count 100] print(f筛选后保留{len(selected_categories)}个类别)经过筛选通常可以保留45-50个主要交通标志类别这能显著提升模型训练效果。3. 从TT100K到COCO格式的转换COCO格式是计算机视觉领域的通用格式之一也是转换到YOLO格式的良好中间步骤。转换过程需要处理以下关键点坐标转换TT100K使用绝对坐标而COCO使用相对坐标类别ID映射为筛选后的类别创建新的连续ID图像尺寸统一虽然TT100K图像都是2048×2048但仍需在标注中明确声明# TT100K到COCO格式转换的核心代码片段 def tt100k_to_coco(tt100k_dir, output_json, selected_categories): coco { images: [], annotations: [], categories: [{id: i1, name: cat} for i, cat in enumerate(selected_categories)] } cat_to_id {cat: i1 for i, cat in enumerate(selected_categories)} for img_file in os.listdir(os.path.join(tt100k_dir, train)): img_id len(coco[images]) 1 coco[images].append({ id: img_id, file_name: img_file, width: 2048, height: 2048 }) ann_file os.path.join(tt100k_dir, annotations, img_file.replace(.jpg, .json)) with open(ann_file) as f: data json.load(f) for obj in data[objects]: if obj[category] in selected_categories: x, y obj[bbox][xmin], obj[bbox][ymin] w, h obj[bbox][xmax] - x, obj[bbox][ymax] - y coco[annotations].append({ id: len(coco[annotations]) 1, image_id: img_id, category_id: cat_to_id[obj[category]], bbox: [x, y, w, h], area: w * h, iscrowd: 0 }) with open(output_json, w) as f: json.dump(coco, f)注意转换过程中要特别注意坐标系的转换和归一化处理这是后续YOLO训练能否成功的关键4. COCO到YOLO格式的终极转换得到COCO格式的数据后我们需要进一步转换为YOLO训练所需的txt格式。YOLO格式的特点是每个图像对应一个同名的txt文件每行表示一个对象格式为class_id center_x center_y width height所有坐标值都是相对于图像宽高的归一化值(0-1)import os import json def coco_to_yolo(coco_json, output_dir, img_dir): with open(coco_json) as f: coco json.load(f) # 创建类别ID映射 cat_id_map {cat[id]: i for i, cat in enumerate(coco[categories])} # 按图像分组标注 img_anns defaultdict(list) for ann in coco[annotations]: img_anns[ann[image_id]].append(ann) # 处理每张图像 for img in coco[images]: img_id img[id] img_w, img_h img[width], img[height] txt_file os.path.join(output_dir, os.path.splitext(img[file_name])[0] .txt) with open(txt_file, w) as f: for ann in img_anns.get(img_id, []): # 转换bbox格式 x, y, w, h ann[bbox] center_x (x w / 2) / img_w center_y (y h / 2) / img_h norm_w w / img_w norm_h h / img_h # 写入YOLO格式 class_id cat_id_map[ann[category_id]] f.write(f{class_id} {center_x:.6f} {center_y:.6f} {norm_w:.6f} {norm_h:.6f}\n)5. 数据集划分与文件组织的最佳实践完成格式转换后我们需要合理划分训练集、验证集和测试集。不同于原始TT100K的划分我们建议采用以下策略合并所有原始数据将train/test/other中的图像全部合并按类别分层抽样确保每个集合中各类别的比例与整体分布一致典型比例70%训练15%验证15%测试from sklearn.model_selection import train_test_split def split_dataset(image_files, test_size0.3, random_state42): 划分数据集 train_val, test train_test_split(image_files, test_sizetest_size, random_staterandom_state) train, val train_test_split(train_val, test_size0.5, random_staterandom_state) return train, val, test # 示例使用 all_images [f for f in os.listdir(tt100k/train) if f.endswith(.jpg)] train_files, val_files, test_files split_dataset(all_images) # 创建目标目录结构 dataset_root/ ├── images/ │ ├── train/ │ ├── val/ │ └── test/ └── labels/ ├── train/ ├── val/ └── test/对于大型数据集直接复制文件可能效率低下。这里推荐使用符号链接来组织数据# 为训练集创建符号链接示例 ln -s /path/to/original/images/train/ /path/to/dataset/images/train ln -s /path/to/converted/labels/train/ /path/to/dataset/labels/train6. 验证转换结果的正确性在投入训练前务必验证转换结果的正确性。以下是几个关键检查点标注与图像对齐随机抽样检查标注框是否准确覆盖交通标志类别分布一致性确保训练/验证/测试集的类别分布相似格式合规性确认YOLO标注文件格式完全正确import cv2 import random def visualize_annotation(image_path, label_path, class_names): 可视化检查标注 image cv2.imread(image_path) h, w image.shape[:2] with open(label_path) as f: for line in f: class_id, cx, cy, nw, nh map(float, line.strip().split()) # 转换回像素坐标 x1 int((cx - nw/2) * w) y1 int((cy - nh/2) * h) x2 int((cx nw/2) * w) y2 int((cy nh/2) * h) cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2) cv2.putText(image, class_names[int(class_id)], (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (36,255,12), 2) cv2.imshow(Annotation Check, image) cv2.waitKey(0) cv2.destroyAllWindows() # 示例使用 class_names [prohibitory, mandatory, warning] # 你的类别列表 sample_image random.choice(os.listdir(dataset/images/train)) image_path fdataset/images/train/{sample_image} label_path fdataset/labels/train/{os.path.splitext(sample_image)[0]}.txt visualize_annotation(image_path, label_path, class_names)7. 高效处理大规模数据集的技巧当处理数万张高分辨率图像时转换过程可能非常耗时。以下是几个提升效率的技巧并行处理使用Python的multiprocessing模块并行处理图像增量处理分批处理数据避免内存不足缓存中间结果保存COCO格式等中间结果便于调试from multiprocessing import Pool def process_image(args): 包装图像处理函数以支持并行 img_file, input_dir, output_dir args # 这里放置实际的转换逻辑 pass # 并行处理示例 def batch_process(image_files, input_dir, output_dir, workers4): with Pool(workers) as p: args [(f, input_dir, output_dir) for f in image_files] p.map(process_image, args) # 分批处理大规模数据集 batch_size 1000 for i in range(0, len(all_images), batch_size): batch all_images[i:ibatch_size] batch_process(batch, tt100k/train, dataset/images/train)在实际项目中我发现将整个转换流程封装成可配置的Pipeline类最为方便可以灵活调整各个步骤的参数同时保存中间状态便于问题排查。

全志T113-S3开发板Linux SDK从零编译指南：环境搭建与固件生成

1. 项目概述与核心思路拿到一块新的开发板，第一件事是什么？通电？看灯？我的习惯是先把它的“灵魂”——也就是固件——给编译出来。这就像给一台新电脑装系统，只有系统跑起来了，你才能在上面施展拳脚。今天要…...

2026/5/20 21:51:10 阅读更多 →

AndroidCupsPrint移动打印终极指南：打破设备壁垒的无线打印革命

AndroidCupsPrint移动打印终极指南：打破设备壁垒的无线打印革命【免费下载链接】AndroidCupsPrint Port of cups4j to Android. Allows wireless printing from any Android device to any CUPS-enabled print server or network printer. 项目地址: https://git…...

2026/5/20 21:50:03 阅读更多 →

告别裸机轮询：在FreeRTOS上为STM32H7和W5500设计高效的TCP Client任务模型

基于FreeRTOS的STM32H7与W5500高效TCP Client架构设计在嵌入式网络通信领域，如何平衡实时性与资源效率始终是开发者面临的挑战。传统裸机状态机方案虽然简单直接，但在处理复杂网络协议和多任务协同工作时往往捉襟见肘。本文将深入探讨如何利用FreeRTOS的…...

2026/5/20 21:47:21 阅读更多 →

单相光伏发电并网控制【附代码】

✨ 长期致力于光伏电池、整流控制、逆变控制、最大功率点跟踪技术研究工作，擅长数据搜集与处理、建模仿真、程序编写、仿真设计。 ✅ 专业定制毕设、代码 ✅ 如需沟通交流，点击《获取方式》 （1）自适应变步长电导增量法最大功率点跟…...

2026/5/19 12:48:20 阅读更多 →

【代码】hot100

Easy 两数之和两数之和 class Solution:def twoSum(self, nums: List[int], target: int) -> List[int]:xdict{}for i in range(len(nums)):jtarget-nums[i]if j in xdict.keys():return [i,xdict[j]]else:xdict[nums[i]]i 有效的括号有效的括号 class Soluti…...

2026/5/19 3:45:22 阅读更多 →

G-Helper终极教程：华硕笔记本轻量级性能控制神器

G-Helper终极教程：华硕笔记本轻量级性能控制神器【免费下载链接】g-helper Lightweight Armoury Crate alternative for Asus laptops with nearly the same functionality. Works with ROG Zephyrus, Flow, TUF, Strix, Scar, ProArt, Vivobook, Zenbook, Expertb…...

2026/5/18 5:24:10 阅读更多 →