What Happened
Researchers at the University of Pennsylvania are building a multi-robot triage system that uses Meta's open-source DINO and SAM 2 models to automatically detect and assess injuries during mass casualty incidents. The project, called PRONTO (Penn Robotic Non-contact Triage and Observation), is part of a three-year DARPA challenge.
The system works in layers. Drones survey a disaster scene from above to locate victims. Ground robots then move in for stable, close-up imaging and vital sign capture. SAM 2 segments objects in the drone and robot footage, even identifying things it was never trained on. Grounding DINO processes text prompts like "wound?" and "blood?" to detect injury-related features in image regions, requiring no labeled training data.
The robots measure heart rate, respiration rate, awareness level, and visible injuries, then use body pose estimation and wound-to-skeletal comparison algorithms to assess severity. Phase 1 wrapped in 2024, Phase 2 testing ran from September 27 through October 4, and Phase 3 will use improved versions of both SAM and DINO.
The team spans Penn Medicine, Penn Engineering, and the GRASP robotics lab.
Why It Matters
Mass casualty events - natural disasters, building collapses, large-scale accidents - overwhelm medical responders. The bottleneck isn't usually treatment capacity; it's figuring out who needs help first when you have dozens or hundreds of injured people in chaotic conditions with dust, darkness, and noise.
What's significant from an AI tools perspective is that DINO and SAM are general-purpose models being applied to an extremely specialized domain. DINO's open-vocabulary detection means it can identify injury markers from text prompts alone, without the months of labeled medical data that traditional computer vision systems would require. SAM 2 segments objects it has never seen in training.
This zero-shot capability is what makes foundation models different from the previous generation of AI tools. You don't need a medical imaging dataset. You need a good foundation model and smart prompting.
Our Take
The DARPA connection tells you something about how seriously the defense establishment takes open-source AI models. They're not building proprietary vision systems from scratch anymore. They're composing Meta's open-source models with robotics and domain-specific logic.
For practitioners, the takeaway is about composability. DINO handles detection, SAM handles segmentation, and the application logic sits on top. This is the same pattern we see in commercial AI tool stacks - foundation models as building blocks, with value created in the integration layer.
The fact that text prompts like "blood?" can drive injury detection without training data is genuinely useful for rapid deployment. When the next disaster hits, you don't have time to collect and label a dataset. You need models that work out of the box with natural language instructions. That's where the field is heading, and PRONTO is an early proof point.