返回模型

说明文档

innit: 快速英语与非英语文本检测

一个轻量级字节级 CNN，用于快速二元语言检测（英语与非英语）。

模型详情

模型类型: 字节级卷积神经网络
任务: 二元文本分类（英语与非英语）
架构: TinyByteCNN_EN，包含 6 个卷积块
参数量: 156,642
输入: 原始 UTF-8 字节（最大 256 字节）
输出: 二元分类（0=非英语，1=英语）

性能

验证集准确率: 99.94%
挑战集准确率: 100%（14/14 测试用例）
推理速度: 现代 CPU 上亚毫秒级
模型大小: ~600KB

支持的语言

训练用于区分英语与 52+ 种不同文字的语言：

拉丁文字: 西班牙语、法语、德语、意大利语、荷兰语、葡萄牙语等
CJK 文字: 中文（简体/繁体）、日语、韩语
西里尔文字: 俄语、乌克兰语、保加利亚语、塞尔维亚语
其他文字: 阿拉伯语、印地语、孟加拉语、泰语、希伯来语等

架构

TinyByteCNN_EN:
├── Embedding: 257 → 80 dimensions (256 bytes + padding)
├── 6x Convolutional Blocks:
│   ├── Conv1D (kernel=3, residual connections)
│   ├── GELU activation
│   ├── BatchNorm1D  
│   └── Dropout (0.15)
├── Enhanced Pooling: mean + max + std
└── Classification Head: 240 → 80 → 2

训练数据

总样本数: 17,543 个平衡样本
英语: 8,772 个样本，来源多样
非英语: 8,771 个样本，覆盖 52+ 种语言
文本长度: 3-276 个字符（针对短文本优化）
特殊覆盖: 表情符号处理、数学公式、科学记数法

快速开始

选项 1: ONNX Runtime（推荐）

import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession("model.onnx")

def predict(text):
    # Prepare input
    bytes_data = text.encode('utf-8', errors='ignore')[:256]
    padded = np.zeros(256, dtype=np.int64)
    padded[:len(bytes_data)] = list(bytes_data)
    
    # Run inference
    outputs = session.run(['logits'], {'input_bytes': padded.reshape(1, -1)})
    logits = outputs[0][0]
    
    # Apply softmax
    exp_logits = np.exp(logits - np.max(logits))
    probs = exp_logits / np.sum(exp_logits)
    return probs[1]  # English probability

# Examples
print(predict("Hello world!"))           # ~1.0 (English)
print(predict("Bonjour le monde"))       # ~0.0 (French)
print(predict("Check our sale! 🎉"))     # ~1.0 (English with emoji)

选项 2: Python 包

# Install the utility package
pip install innit-detector

# CLI usage
innit "Hello world!"                    # → English (confidence: 0.974)
innit --download                        # Download model first
innit "Hello" "Bonjour" "你好"          # Multiple texts

# Library usage
from innit_detector import InnitDetector
detector = InnitDetector()
result = detector.predict("Hello world!")
print(result['is_english'])  # True

选项 3: PyTorch（高级）

import torch
import torch.nn.functional as F
from safetensors.torch import load_file
import numpy as np

# Load model (requires TinyByteCNN_EN class definition)
state_dict = load_file("model.safetensors")
model = TinyByteCNN_EN(emb=80, blocks=6, dropout=0.15)
model.load_state_dict(state_dict)
model.eval()

def predict(text):
    bytes_data = text.encode('utf-8', errors='ignore')[:256]
    padded = np.zeros(256, dtype=np.long)
    padded[:len(bytes_data)] = list(bytes_data)
    
    with torch.no_grad():
        logits = model(torch.tensor(padded).unsqueeze(0))
        probs = F.softmax(logits, dim=1)
        return probs[0][1].item()

ONNX 支持

提供 ONNX 版本用于跨平台部署：

model.onnx - 全精度（FP32），最大兼容性

挑战集结果

在全面测试用例上达到完美 100% 准确率：

超短文本: "Good morning!" ✅
表情符号处理: "Check out our sale! 🎉" ✅
数学公式: "x = (-b ± √(b²-4ac))/2a" ✅
科学记数法: "CO₂ + H₂O → C₆H₁₂O₆" ✅
多样文字: 阿拉伯语、CJK、西里尔文、天城文 ✅
类英语语言: 荷兰语、德语 ✅

局限性

仅支持二元分类（英语与非英语）
针对最长 256 个 UTF-8 字节的文本优化
对于训练数据中非常罕见的语言，准确率可能降低
不适用于多语言文本（单个输入中混合多种语言）

许可证

MIT 许可证 - 可免费商用。

Mitchins/innit-language-detection

作者 Mitchins

text-classification pytorch

↓ 1 ♥ 0

创建时间: 2025-08-27 03:19:35+00:00

更新时间: 2025-08-27 03:33:52+00:00

在 Hugging Face 上查看

文件 (6)

.gitattributes

README.md

config.json

model.onnx ONNX

model.safetensors

model_info.json