For medical applications, the ground truth is ascertained through manual labels by clinical experts. However, significant inter-observer variability and various human biases limit accuracy. A probabilistic framework addresses these issues by comparing aggregated human and automated labels to provide a reliable ground truth, with no prior knowledge of the individual performance. As an alternative to median or mean voting strategies, novel contextual features (signal quality and physiology) were introduced to allow the Probabilistic Label Aggregator (PLA) to weight an algorithm or human based on its performance. As a proof of concept, the PLA was applied to QT interval (pro-arrhythmic indicator) estimation from the electrocardiogram using labels from 20 humans and 48 algorithms crowd-sourced from the 2006 PhysioNet/Computing in Cardiology Challenge database. For automatic annotations, the root mean square error of the PLA was 13.97 ± 0.46 ms, significantly outperforming the best Challenge entry (16.36 ms) as well as mean and median voting strategies (17.67 ± 0.56 ms and 14.44 ± 0.52 ms respectively with p < 0.05). When selecting three annotators, the PLA improved the annotation accuracy over median aggregation by 10.7% for human annotators and 14.4% for automated algorithms. The PLA could therefore provide an improved "gold standard" for medical annotation tasks even when ground truth is not available.