Operator fatigue is the failure mode
A security operator watching 16 camera feeds cannot physically pay attention to all of them. The moment that matters is usually the one nobody was watching.
01 The problem we're solving
Most CCTV deployments are reactive. Hours of footage are reviewed only after an incident, cameras do not talk to each other, and the few smart systems that exist flood operators with false alarms or hide behind a single opaque score.
A security operator watching 16 camera feeds cannot physically pay attention to all of them. The moment that matters is usually the one nobody was watching.
Single detections are noisy in CCTV footage. A useful system must suppress weak signals, bind evidence over time, and show why an alert fired.
A single frame cannot reliably separate harmless contact from hostile motion. HARIS adds the temporal dimension through skeleton windows and tracked identities.
02 In Action
Real CCTV footage, annotated live by HARIS. Click any feed to focus it.
03 The pipeline
The published CORE pipeline is a specialist chain. Each tier solves one sub-problem and passes structured evidence to the next, so alerts can be audited instead of treated as black-box scores.
Suppresses still frames and opens downstream inference only when motion evidence exists.
Runs one fine-tuned weapon detector for gun and knife, plus one COCO-pretrained detector for people.
Maintains person IDs and extracts 17-keypoint COCO skeletons for tracked people.
Classifies 30-frame skeleton windows and emits a JSON evidence chain for operator review.
| Stage | Component | What it contributes |
|---|---|---|
| Motion | Frame-difference gate | Reduces wasted inference on still footage and makes the rest of the pipeline event-aware. |
| Detect | Dual RT-DETR | Fine-tuned HARIS weapon detector for gun and knife, paired with COCO-pretrained person detection. |
| Track and pose | BoT-SORT plus RTMPose-L | Stable person IDs across frames and 17 COCO keypoints per tracked person for skeleton reasoning. |
| Action | ST-GCN | Skeleton-based action classification over 30-frame temporal windows instead of one-frame guesses. |
| Reason | Aggressor Logic Engine | Combines detections, tracks, pose, temporal windows, and holder binding into a JSON evidence chain. |
04 Measured results
These figures follow the published paper. The runtime latency benchmark was measured on an RTX 3070. Training was performed on an RTX 5070 Ti, and the two hardware contexts are kept separate.
per-frame weapon FP
Per-frame weapon false positives were reduced from 8.49% to 0.75%. The paper reports a 74% relative reduction.
video-level FP reduction
2 of 25 benign videos were flagged, compared with 6 of 25 for the raw detector.
end-to-end F1
On in-scope UCF-Crime classes: Shooting, Assault, and Fighting. Precision 0.812, recall 0.688.
weapon mAP@50
Aggregate validation mAP@50, with gun 0.768, knife 0.711, precision 0.836, recall 0.690.
curated images
Deduplicated images across 7 source datasets, with leak-audited GroupKFold splits.
mean runtime latency
Measured on RTX 3070: 154.5 ms P95 and 10.1 FPS end-to-end throughput.
05 Operator-facing features
The dashboard is designed as a professional DVR and NVR replacement, not a research notebook. Every overlay is toggleable, every threshold is live-tunable, and every alert shows its reasoning.
Skeleton and mannequin rendering for every tracked person. When pose estimation drops a frame, a last-valid-pose snapshot holds briefly, then falls back to a generic body glow.
Operators tune confidence sensitivity in real time, with live impact on detection panels, overlay strokes, the auto-flagger, and the threat heatmap timeline.
The scrub bar renders detected threat density across the clip, so operators can scan a long video quickly and jump to the seconds that matter.
Detected weapons are bound to the wrist of the nearest tracked person through pose-based proximity, making crowded-scene ownership easier to audit.
Per-clip brightness and contrast boosts for low-light footage, plus customizable tint for washed-out daytime clips. Both persist per operator.
Every alert carries its evidence: frames, people, weapon class, confidence scores, and the temporal window. Operators can acknowledge, mark false-positive, or escalate.
06 What makes HARIS different
HARIS is not a monolithic detector wrapped in a dashboard. The system keeps named model boundaries and exposes the evidence chain to the operator.
Every decision is traceable to a named sub-model. When HARIS is wrong, we know which stage was wrong and can fix that stage without retraining the whole system.
Actions are classified over 30-frame skeleton windows. Single-frame detections never carry the whole decision.
A clean JSON boundary means future mobile and desktop clients can plug into the same server for portable operator workflows.
Group-aware splits, source-level deduplication, and published limitations keep the results anchored to what the system can actually do.
07 Roadmap
Qwen2.5-VL summaries remain planned, while FaceNet re-ID is treated as an operator-gated live feature outside the published CORE pipeline.
Problem framing, literature review, initial dataset construction, first-pass detector and pose integration. Proposal defense passed.
Tier 0 motion gate, dual RT-DETR, BoT-SORT, RTMPose-L, ST-GCN, and Aggressor Logic Engine running end-to-end with auditable JSON output.
Fine-tuned detector, paper metrics, continuous body overlay, threshold controls, threat heatmap, holder binding, and night or tint modes.
On-alert visual-language summaries that explain scene context in natural language after a structured alert has already fired.
A temporal-smoothing variant that improves role assignment stability for aggressor, defender, and bystander labels.
Profiling and optimization for constrained deployment targets beyond the desktop GPU environment used in the paper.
08 Honest limitations
We publish the caps. A system that pretends to have no limits hides those limits from its operators, which is the opposite of what surveillance AI should do.
Tracking applies to everyone in frame, but skeleton-based action classification applies to the four highest-detection-confidence persons.
The dashboard is scoped for short-clip operator workflows, with a 60-second, 10 FPS default upload cap that can be overridden for evaluation runs.
Tracker re-identification and watchlist matching are gated because the re-ID path has a wall-time cost and privacy implications.
Far-field subjects with low-quality skeletons can keep bounding boxes and tracks while dropping action labels.
Knife boxes are harder to localize at small scale and in occlusion, which can affect holder binding and confidence.
Very small people, compressed footage, and poor camera angles still reduce detector, pose, and action-classifier reliability.
09 Team