pH is a crucial physicochemical property that affects proteins molecular structure, folding, stability, and function. Many computational methods have been developed to calculate pKa values. In the highly accurate, but slow, Poisson–Boltzmann (PB)-based methods, proteins are represented by point charges in a low dielectric medium surrounded by an implicit solvent (high dielectric). Empirical methods rely on statistically fitting parameters over large datasets of experimental pKa values. These are much faster than the physics-based methods, although at the cost of less microscopic insights and unknown predictive power on mutations and proteins dissimilar to those in the training set.
Here, I will present a novel strategy to combine the best features of PB models – accuracy and interpretability – with the speed of classical empirical methods. The deep learning pKa predictors obtained were trained on a database of 3M theoretical pKa values estimated from 50k structures using a PB method. With this approach, we can retrieve the physics-based predictions with an average error below 0.4 pK units while being up to 1000x faster.