Data Independent Acquisition (DIA) strategies are an integral part of proteomics studies involving large cohorts. After acquisition, DIA-MS data is traditionally analysed using spectral reference libraries (SRLs) created from separate Data Dependent Acquisition (DDA) experiments. In contrast, the utility of SRLs generated using DIA rather than DDA has not been extensively evaluated. Novel machine learning tools have emerged for processing DIA data, such as DIA-NN, enabling new directions. Our aim was to compare the performance of SRLs derived from either DIA (via DIA-NN) or DDA acquisition modes.
We used 1,261 fresh frozen cancer samples encompassing 73 cancer types from 27 tissue types. Samples were acquired on three Triple TOF 6600 MS instruments in technical triplicate (one run per MS). Samples were grouped based on histopathology and were combined to produce 39 separate pools. Three approaches were used to search the data. First, individual sample DIA files were searched using DIA-NN (“SRL-free”). Alternatively, sample pools were fractionated using high-pH RP-HPLC (15 fractions) after which data was acquired using either DIA or DDA modes (39x15x2 runs, "DIA-SRLs” and “DDA-SRLs”, respectively). The conventional DDA-SRLs were produced using Protein Pilot/PeakView, while DIA-SRLs were produced using DIA-NN v1.8.
In a lung cancer squamous pool, the “DIA-SRL”, “DDA-SRL”, and “SRL-free” approaches identified 9451, 6118, and 8491 proteins, respectively. The average protein overlap of the three approaches was 60%. DIA-SRLs improved the number of proteins by 40% compared to the other two approaches, however the missing values were around 50% when the data was searched. DDA-SRLs produced the lowest number of identifications and 84% missing values. The SRL-free approach had the lowest missingness, at 35%. Analysis of the proteins uniquely identified in each approach, suggests that the missing values can be attributed to proteins identified using the high pH fractionation methods.
Conventional DDA-SRLs perform poorly in comparison with the other two methods used in this study. While pooled high-pH fractionation of complex samples is a common strategy for SRL generation offering improved protein identification, our data shows there is a cost of a high proportion of missing values and low intensity peptides, which has implications regarding the usefulness of fractionation. Overall, our “SRL-free” approach using DIA-NN comprehensively describes large scale clinical cohorts, and offers a faster alternative for proteomic analysis where sample availability may be limited.