Missing With(out) a Trace

A primer on missing values in quantitative mass spectrometry-based proteomics

Egor Vorontsov
11 min readMar 29, 2021
Photo by Sigmund on Unsplash

Proteomics is the large scale study of proteins in a cell, tissue or organism. The performance of proteomic analysis has grown tremendously over the years, largely powered by mass spectrometry (MS). At this point, nobody is surprised to measure quantitative changes of thousands of proteins across hundreds of samples using MS. A less exciting “feature” of MS-based proteomics is the presence of missing values (MVs) in the quantitative data. A caveat of producing ever growing data sets is that missing values become more prevalent with larger study size! In this post, I will give a short overview of the origins of MVs in proteomic experiments, aimed at researchers who need to make sense of MS-based proteomic data, but do not necessarily have a prior experience with MS. This post will only focus on well-established experiment types, not trying to cover a wide variety of novel approaches that pop up and develop all the time. Neither will I go into the MV imputation strategies, that would require a dedicated (long) post. If you would like to dive into the publications on MV imputation on proteomic data, I could suggest [1, 2, 3] as great starting points. Now, without further delay, let’s get to the subject! I will try to explain how the popular MS-proteomic methods work and then discuss how MVs occur in them.

How does identification happen in MS-based proteomics?

Proteins in the living cell come in all shapes and sizes in order to fulfill their diverse functions. Extraordinary wide range of chemical and physical characteristics makes protein analysis a very tall order from the MS standpoint. Cutting proteins into smaller and more uniformly sized pieces (peptides) greatly facilitates sample handling and MS analysis. The star here is the intestinal digestive enzyme trypsin that cuts proteins at a C-terminus of every lysine and arginine amino acid residue. Analyzing the digested peptides to infer knowledge on the original proteins is also called bottom-up proteomics. The resulting mixture of digested peptides is typically way too complex to be analyzed at once, so at least one separation step is required, usually by liquid chromatography, hence the term LC-MS. A typical workflow is shown below:

LC-MS proteomic workflow. Modified from the original image by Philippe Hupé, which is available under CC BY-SA 3.0

The digested peptide mixture is separated over time, peptides exit the chromatographic column and attain electric charge, turning into ions. Mass spectrometer records a snapshot of the mass-to-charge values (m/z) for the peptide ions that come up at a given moment. We can then select peptide ions of a certain m/z, pass on additional energy and observe as they disintegrate into fragments. Peptide ions are also called precursor ions in this context, as opposed to the fragment ions that are generated during the fragmentation stage of the experiment. A combination of the precursor mass with the spectrum of fragment ions is often sufficient to identify the corresponding peptide. The precursor and fragment mass spectra act in tandem, thus the common term tandem mass spectrometry, which is sometimes also expressed as MSMS or MS/MS. An LC-MS system can automatically iterate through peptide ions, select them one-by-one, produce fragment ions, record spectra and exclude the analyzed ions from further selection. This can be called global, or shotgun proteomics, and it delivers deep explorative analysis of a sample, even if it’s content is unknown beforehand.

Data-dependent acquisition (DDA) tandem LC-MS experiment with a precursor spectrum and fragmentation spectra of the 3 most abundant precursors. Image by author, re-use with attribution according to CC BY 4.0

Selection of peptides is rule-based, but it ultimately depends on which peptides show up at a given moment during the analysis, thus called a data-dependent acquisition mode (DDA). Being influenced by the sample composition and by slight changes in experimental conditions, the outcome of the analysis has an element of stochasticity. The most prevalent peptides will be selected for identification with near certainty, while minor peptides can slip through.

You probably see where this is going, a peptide can remain unidentified even if it is present in the mixture. In fact, even if we analyze the same sample twice on the same mass spectrometer, we will get slightly different lists of identified peptides. There are more deterministic approaches to global proteomic analysis, like data-independent acquisition (DIA), where a mass spectrometer systematically records fragmentation spectra at predefined intervals along the LC-MS run. There’s also targeted proteomics with synthetic peptide standards, which provides even more reliability by mixing the biological sample with known amounts of the synthesized peptides that correspond to the target proteins. These approaches have been actively developing in recent years, but if you get a global proteomic data set to work with in 2021, the chances are it will be the good old DDA.

We have looked at the identification of peptides, now it’s time to get quantitative.

Quantification on precursor ions

It would be convenient to calculate the amount of a peptide simply based on the strength of it’s signal in a mass spectrometer! Unfortunately, things don’t work that way, at least for now, as the same amount of a peptide can produce different signal intensity depending on environmental factors and on other components in the mixture. In addition, different peptides can yield very dissimilar signals at the same concentration! But if we keep the conditions controlled and reproducible enough, we can derive quantitative information based on the relative signal intensity of a peptide across samples. Quantification based on the relative abundance of peptide (precursor) ions between samples is often called label-free quantification in the context of LC-MS proteomics, as opposed to the label-based methods that we will discuss later.

Label-free quantification experiment with 3 samples, 3 LC-MS files and 5 precursor ions/peptides. The shape and exaggerated width of the chromatographic peaks do not represent experimental measurement. Image by author, re-use with attribution according to CC BY 4.0

When LC-MS files are ready, we need to convert the mass spectra into protein abundance data. Generally speaking, the processing can be summarized into the following stages:

  1. Find ions with peptide-like masses (precursor ions) and quantify their signal intensities over time. Find maximum signal or the area under the curve for each precursor.
  2. Link precursors between samples based on mass and chromatographic elution time.
  3. Use fragmentation spectra to identify peptides.
  4. Link the quantified signal intensities to peptide IDs based on accurate mass and chromatographic elution time; there will likely be many signals left without an ID.
  5. Quantify proteins by summing up the signals for the corresponding peptides.

Crucially, quantification and identification happen at different stages of the process. This leads to uncertainty as to why there’s a missing value at a particular position. The reason can be that:

  • the corresponding quantitative signals are missing, a peptide is absent or below the quantification threshold in some of the samples,

or because:

  • a peptide ID has not been assigned to an otherwise valid quantitative signal.

Identification can be unsuccessful if the recorded fragmentation spectra are of insufficient quality, or if a precursor has not been fragmented at all. Ease of identification depends on the individual properties of a peptide, and it also drops with the weakening of the peptide signal. What’s more, there’s a higher chance for a peptide with low signal intensity to be completely missed by the data-dependent selection. This leads to an inverse relationship between the peptide abundance and the number of missing values per peptide across samples, as you can see below:

Median peptide signal intensity by the number of missed cleavages. Summarized for a label-free study with triplicate analysis for each biological sample, which means that exactly the same peptides were present in all 3 runs. Image by author, data pulled from the project PXD004179 in PRIDE data archive.

Identification can also be unsuccessful if the precursor mass has been determined incorrectly. The tendency for erroneous mass measurement can depend on the instrument type and condition, firmware, likely on peptide abundance and mass, so it is quite difficult to propose a generalized model for such MVs. But it may be worth exploring for a particular kind of data that you are interested in, especially in case of peptide-centric studies, such as phosphorylation analysis. Improvements in the MS technology and post-processing algorithms [4] can help to tackle this source of MVs.

Proteins often map to several quantified peptides, which helps to reduce the frequency of MVs, as you can see on the plot above. However, many proteins are represented by a single peptide, so the peptide-level MV patterns remain relevant for proteins.

Quantification by SILAC

Stable Isotope Labeling by Amino acids in Cell culture (SILAC) is a quantification method that is also based on precursor intensities, but with a clever twist. While preparing a biological experiment, for example, growing cells, we can add stable isotope-labeled amino acids to the culturing medium in one of the preparations, which yields proteins that contain heavy isotopes of carbon and nitrogen. Another cell culture can be grown in parallel using the nutrients containing the light isotopes. If the “light” and “heavy” cell preparations are then mixed, processed and analyzed by LC-MS, peptides will come in two isotopic versions, giving two precursor ions with different masses. The intensity ratio between the respective pairs of precursor ions will inform us about the ratio between the amounts of the peptide in two samples:

SILAC experiment with 4 samples and 2 types of cell culturing conditions: containing light isotopes of carbon and nitrogen, and isotopically labeled, or “heavy”. The shape and exaggerated width of the chromatographic peaks do not represent experimental measurement. Image by author, re-use with attribution according to CC BY 4.0

Quantification and identification are separated in SILAC, just like in the label-free method, so it does happen that signals have been detected, but no valid identity has been assigned to them. If an ID is missing, it is normally missing for both isotopic versions of the peptide simultaneously. We can expect that the frequency of MVs is in reverse relation to peptide abundance, but an abundance ratio can be high or low, and at the same time missing from the results because of the low absolute signal abundance.

Isobaric labeling-based MS proteomics

Another popular approach for relative quantification is based on chemical modification of peptides with isotopically substituted reagents, also called isobaric labeling tags. It is widely used in biological and medical research, recently it has also been applied to single-cell (SC) proteomics [5], so if you get an MS-based SC-proteomic data set, there’s a fair chance it will be based on isobaric labeling. For a more detailed understanding of the technology, I refer you to an open access review by Rauniyar and Yates [6].

Isobaric tags are a series of chemical reagents that can attach to peptides and contain heavy isotopes of carbon and nitrogen. They are designed in a way that all reagents in the set have equal total number of heavy isotopes of carbon and nitrogen, currently up to 16 reagents in a set, so the peptides have the same precursor mass after modification, regardless of which of the 16 reagents was used; however, each of the reagents produces a reporter ion with unique mass in the fragment ion spectrum. Now, if we take a bunch of samples, digest them into peptides, label each sample with it’s own reagent from the set, then mix the labeled samples (into one labeled set/plex/batch) and analyze them using LC-MS, we will get one precursor mass per peptide for all the mixed samples, and the fragmentation spectrum will be relevant to all of the mixed samples in the plex at once. But we can deduce the relative amounts of a peptide across samples by looking at the intensities of the specific reporter ions in the low m/z region of the fragmentation spectrum:

Isobaric labeling proteomic workflow with 4 unique reagents in a set and 7 different biological samples combined into 2 labeling batches (plexes). The shape and exaggerated width of the chromatographic peaks do not represent experimental measurement. Image by author, re-use with attribution according to CC BY 4.0

Once again, identification and quantification are separated, this time occurring in different parts of the fragmentation spectrum, which means that there can also be quantitative values without IDs. However, the isobaric labeling strategy has it’s own peculiarities:

  • an exact time when a fragmentation spectrum is acquired has an element of stochasticity, see t1 and t2 on the figure, so the reporter intensities per se can vary from run to run, even if the amount of the peptide stays the same. It is the profile of the relative intensities within the plex that matters. This also warrants the need for a proper normalization within each batch for each protein. It can be, for example, normalization on a common sample that is present in all plexes across the study.
  • Since the identification happens for the whole batch at once, a missing ID will be simultaneously missing for all the of samples in the same set. Regardless of the actual relative intensities within the plex, all the values will become MVs without an ID.
  • In theory, a value could be missing within a batch when a protein is absent in a sample. In practice, due to isotopic impurities in the reagents, there’s often a weak noise signal, even if it is known that the particular protein is completely absent from the sample.

This means that the number of MVs per peptide or protein often doesn’t take on all possible values based on the number of samples. For example, in a data set consisting of two 9-plexes and three 8-plexes, each peptide/protein has a number of MVs comprised of a sum of eights and nines:

Missing values per peptide and per protein in an isobaric labeling-based study (log-scale on x-axis). Forty two samples arranged into 5 plexes, 2 x 9 and 3 x 8 samples each. Image by author. Data pulled from the project PXD021218 in PRIDE data archive.

As in other proteomic approaches discussed above, each protein might map to a whole bunch of identified peptides, which somewhat reduces the prevalence of MVs on the protein level, see the plot above above. However, many proteins are only represented by 1 or 2 peptides, so the protein-level data still contains plenty of MVs.

Frequency of MVs presents a challenge for large studies. This means that when we move into the realm of clinical proteomics with hundreds of patient samples, or into SC proteomics with thousands of cells, valid values in DDA analysis become more and more scarce. To add insult to injury, we are often interested in proteins with relatively low abundances that have interesting biological functions, and as we have seen above, lower abundance means more missing values. This justifies further development of other MS acquisition modes that produce less MVs by design, like the aforementioned DIA, or wider use of data analysis approaches that tolerate missing values.

Conclusions

We have looked at the popular MS approaches for relative protein quantification: label-free precursor quantification, SILAC and isobaric labeling-based quantification. We have noted that identification and quantification of peptides does not occur simultaneously. Thus, missing values in proteomic data set may occur when the corresponding peptide is actually missing, or if it is present, but unidentified. The propensity for missing values seems to be in reverse relationship with abundance of the particular peptide. Finally, we have mentioned that missing values in an isobaric labeling-based proteomics mostly occur for the whole labeling batch at once, transforming both low and high true relative intensities into missing values.

References

[1] C. Lazar et al. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies (2016), J. Proteome Res. 15, 4, 1116–1125. Open access under CC-BY license.

[2] L.M. Bramer et al. A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun Proteomics (2021), J. Proteome Res. 20, 1, 1–13.

[3] W. Ma et al. DreamAI: algorithm for the imputation of proteomics data (2020), bioRxiv. Open acces under CC-BY-NC-ND 4.0 International license.

[4] R. Rad et al. Improved Monoisotopic Mass Estimation for Deeper Proteome Coverage (2021), J. Proteome Res., 20, 1, 591–598.

[5] B. Budnik, E. Levy, G. Harmange et al. SCoPE-MS: mass spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiation (2018), Genome Biol 19, 161. Open access.

[6] N. Rauniyar and J. R. Yates, III. Isobaric Labeling-Based Relative Quantification in Shotgun Proteomics (2014), J. Proteome Res., 13, 12, 5293–5309. Open access.

--

--

Egor Vorontsov

Answering biological questions with data. I am posting on this platform for educational purposes, sharing tips and receiving feedback.