返回模型

说明文档

SentiStream 编码器 V4

一款专为流式环境中的实时情感分析设计的专用降维引擎。它将高维 BGE-small-en-v1.5 嵌入（384维）压缩为经过优化的结构化 5维流形，适用于基于 DBSTREAM 的增量聚类。

🧠 架构与方法论

本项目通过实现两阶段训练流程，解决了在线聚类中的"维度灾难"问题：

监督预热：编码器经过预训练以分类情感极性（二分类），确保生成的潜在空间按情感语义分离。
参数化 UMAP（PUMAP）：一种基于神经网络的 UMAP 实现，学习全局变换函数，使新的、未见过的数据能够投影到同一流形中，无需重新训练。

📊 性能指标

输入维度：384（兼容 BAAI/bge-small-en-v1.5）
潜在维度：5
预训练准确率：99.92%（Amazon 评论极性子集）
推理延迟：< 1ms/批次（通过 ONNX Runtime 在 CPU 上运行）

📂 仓库结构

model.onnx：用于高速推理的生产级权重。
pytorch_model.bin：原始 PyTorch 状态字典。
scaler_config.json：StandardScaler 的均值和缩放参数（推理时必需）。
config.json：模型超参数和架构元数据。

🚀 快速开始

1. 依赖要求

pip install onnxruntime numpy

2. 推理示例 (ONNX)

import onnxruntime as ort
import numpy as np
import json

# 加载缩放器参数
with open("scaler_config.json", "r") as f:
    config = json.load(f)
    mean, scale = np.array(config['mean']), np.array(config['scale'])

# 加载 ONNX 会话
session = ort.InferenceSession("model.onnx")

def encode(embeddings):
    # 标准化输入
    x = (embeddings - mean) / scale
    # 运行模型
    inputs = {session.get_inputs()[0].name: x.astype(np.float32)}
    return session.run(None, inputs)[0]

🛠 训练详情

优化器：Adam (lr=3.57e-3)
损失函数：混合交叉熵 + UMAP 损失
流形超参数：n_neighbors=26, min_dist=0.00366
环境：ArchLinux, Python 3.12, CUDA 12.x

⚖ 许可证

MIT

ZenbiteXYZ/sentistream-encoder-v4

作者 ZenbiteXYZ

↓ 20 ♥ 0

创建时间: 2026-03-16 16:43:45+00:00

更新时间: 2026-03-16 18:39:16+00:00

在 Hugging Face 上查看

文件 (7)

.gitattributes

README.md

config.json

model.onnx ONNX

pytorch_model.pth

scaler_config.json

sentistream_encoder_v4.onnx.data