--- title: "Input Preparation Workflow" author: "Dina Schuster" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Input Preparation Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction This vignette will give you an overview of how you can prepare the quantitative protein/peptide matrix output from common search engines and software such as [Spectronaut](https://biognosys.com/software/spectronaut/), MaxQuant, [Proteome Discoverer](https://www.thermofisher.com/order/catalog/product/OPTON-30945) and [Skyline](https://skyline.ms/project/home/software/Skyline/begin.view) for the analysis with **protti**. Due to its modular and flexible structure **protti** can be used on the output of common bottom-up proteomics search engines irrespective of the measurement mode (DDA, DIA, targeted-MS). Furthermore, you are not only restricted to reports from the above mentioned search engines. As long as your data has a tabular format (data frame) and a specific minimal number of data columns you can analyse it with **protti**. The columns minimally required contain information on sample, condition, intensity, protein ID and the level the intensity is based on (fragment, precursor, peptide) if different from protein intensity. Depending on the analysis many more columns can be useful, but they are not required. Ultimately, your data should have a structure similar to this: | Sample | Protein ID | Peptide Sequence | Condition | Intensity | |:-------:|:-----------:|:-----------------:|:----------:|:---------:| | sample1 | P62942 | PEPTIDER | treated | 14000 | | sample2 | P62942 | PEPTI | treated | 15000 | | sample3 | P62942 | PEPTIDE | treated | 14500 | | sample4 | P62942 | PEPTIDER | control | 18000 | | sample5 | P62942 | PEPTI | control | 21000 | | sample6 | P62942 | PEPTIDE | control | 19000 | It is very important, that each unit of the level you perform your analysis on (e.g. peptide) has a single unique intensity associated with it. If, for example, a peptide has two different intensities, **protti** would not know how to deal with this and many functions will likely fail. Data should always be organised in a format called [tidy data](https://r4ds.had.co.nz/tidy-data.html). That means data should be contained in a long format (e.g. all sample names in one column) rather than a wide format (e.g. each sample name in its own column with intensity as the content of the columns). You can easily achieve this by using the `pivot_longer()` function from the [`tidyr`](https://tidyr.tidyverse.org/index.html) package. The output of many search engines already contains tidy data and working with it is very easy because you can refer to information with only one variable. **protti** is designed to work together well with the [`tidyverse`](https://www.tidyverse.org) package family that is build around the concept of tidy data. ## Protein-centric analysis Many search engines provide the user with protein intensities. However, it is also possible to calculate protein intensities directly from precursor intensities with the **protti** function `calculate_protein_abundance()`. **Protti** implements the `"iq"` method, previously implemented in the R package [`iq`](https://doi.org/10.1093/bioinformatics/btz961) which performs protein quantification based on the maximal peptide ratio extraction algorithm adapted from the MaxLFQ algorithm ([Cox, J. 2013](https://doi.org/10.1074/mcp.M113.031591)). One advantage of calculating the protein abundance with **protti** is the possibility to median normalise run intensities on the precursor level. This is closer to the actually acquired intensities and thus sample concentrations than if normalisation is performed on the protein level. Some search engines provide the option for automatic median normalisation but not all. Furthermore, some search engines calculate protein intensities by summation of precursor intensities irrespective of missingness of peptides in certain samples. In these cases the maximal peptide ratio implemented in extraction algorithm provides a more robust calculation of protein intensities. If you prefer to use protein intensities provided by the seach engine of your choice this is not a problem and we will show how some of this information can be converted into the right format. ## Loading packages We will demonstrate how most outputs can be converted with functions from the R packages [`magrittr`](https://magrittr.tidyverse.org/index.html), [`dplyr`](https://dplyr.tidyverse.org/index.html), [`tidyr`](https://tidyr.tidyverse.org/index.html) and [`stringr`](https://stringr.tidyverse.org/). You can load packages after you installed them with the `library()` function. ```{r setup, message = FALSE, warning = FALSE} library(magrittr) library(dplyr) library(tidyr) library(stringr) ``` Note that we are using the R package [`magrittr`](https://magrittr.tidyverse.org/index.html) because of its pipe operator `%>%`. It takes the output of the preceding function and supplies it as the first argument of the following function. Using `%>%` makes code easier to read and follow. # Spectronaut Spectronaut reports already contain data in the tidy data format. Therefore nothing needs to be changed in order to use them with **protti**. However, the columns we would recommend (not all columns are required) to export from Spectronaut are: * R.Condition (condition names) * R.FileName (file names) * PG.ProteinAccessions (protein identifiers) * PEP.IsProteotypic (logical indicating if peptide is proteotypic) * PEP.StrippedSequence (peptide sequence) * PEP.NrOfMissedCleavages (number of missed cleavages, relevant for quality control) * EG.IsDecoy (logical indicating if peptide is a decoy match) * EG.PrecursorId (peptide precursor ID) * FG.Quantity (precursor quantity, required for peptide-centric analyses) * FG.Charge (precursor charge state, relevant for quality control) * PG.Quantity (protein group quantity, required for protein-centric analyses if you prefer not to calculate them with `calculate_protein_abundance()`) Please make sure that the report is a .csv file. You can use the `read_protti()` function in order to load the spectronaut report into R. This function is a wrapper around the fast `fread()` function from the [`data.table`](https://rdatatable.gitlab.io/data.table/) package and the `clean_names()` function from the [`janitor`](https://sfirke.github.io/janitor/) package. This will allow you to not only load your data into R very fast, but also to clean up the column names into lower snake case. This will make it easier to remember them and to use them in your data analysis. For the Spectronaut columns `R.FileName` will change for example into `r_file_name`. ```{r Spectronaut, eval=FALSE} # To read in your own data you can use read_protti() spectronaut_data <- read_protti(filename = "mydata/spectronaut.csv") ``` # MaxQuant Depending on which analysis you are performing you will have to use different outputs. For peptide-centric analyses we would recommend to use the `evidence.txt` file. If you want to perform a protein-centric analysis and you want to use protein quantities calculated by MaxQuant, you need the `proteinGroups.txt` file. However, you can also apply the maximal peptide ratio extraction algorithm from the `iq` R package implemented in the `protein_abundance_calculation()` function of **protti**. This allows you to only use the `evidence.txt` file. The resulting protein intensities are identical since they were calculated with the same algorithm. ## Peptide-centric analysis/LiP-MS analysis In case you are interested in performing a **peptide-centric** analysis (necessary for LiP-MS), you should use the `evidence.txt` file provided in the search output of MaxQuant. The `evidence.txt` file basically contains all the information we need to run **protti**. It is also contained in a long format which makes it easy to read in and use directly. One thing to take into consideration is the lack of a column for information on proteotypicity of peptides. However, this information can be inferred from the `Proteins` column if it contains more than one protein ID. You can extract this information and create a new column called `is_proteotypic` containing logicals that will be `TRUE` if the `Proteins` column does not contain a semicolon and `FALSE` if it does (this indicates that the peptide belongs to more than one protein). As mentioned in the data analysis vignettes this information is necessary for the analysis of LiP-MS data but it could be also considered for the correct calculation of protein abundances. Another column that is required for the analysis of your data is a column indicating conditions to which certain samples belong. This can be easily added to the evidence file by joining a data frame containing the specific annotations. You can create such a data frame in Excel and import it into R for a large number of samples or just create it directly in R. MaxQuant output provides information on decoy hits contained in the column `reverse` and also has information on whether your hit is a contaminant `potential_contaminant`. You should filter these out before the analysis. However, the contaminant column can be used for quality control. One important thing for MaxQuant data is to **make sure that you only have one intensity assigned to each peptide or precursor.** You can do this by summing up all intensities that MaxQuant exports (these can be MULTI-MSMS, MSMS, ISO-MSMS, MULTI-MATCH, ISO-SECPEP) or you can filter for example for precursors with MULTI-MSMS quantification and only use these. In this section we will show you how to read in the file with `read_protti()` and how to create the `is_proteotypic` column and the `condition` column (minimally required) with the help of the `stringr` and `dplyr` packages. How to filter your data best is described in the data analysis vignettes. ```{r MaxQuant_peptide, eval=FALSE} # To read in your own data you can use read_protti() evidence <- read_protti(filename = "yourpath/evidence.txt") evidence_proteotypic <- evidence %>% # adds new column with logicals that are TRUE if the peptide can be assigned # to only one protein and FALSE if it can be assigned to multiple mutate(is_proteotypic = str_detect( string = proteins, pattern = ";", negate = TRUE )) %>% # adds new column with logicals indicating if peptide is coming from a potential contaminant mutate(is_contaminant = ifelse(potential_contaminant == "+", TRUE, FALSE)) # Make an annotation data frame and merge it with your data frame to obtain conditions # We are annotating sample 1-3 as controls and samples 4-6 as treated conditions file_name <- c( # make sure that the names are the same name as in your report "sample1", "sample2", "sample3", "sample4", "sample5", "sample6" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(file_name, condition) # Combine your long data frame with the annotation evidence_annotated <- evidence_proteotypic %>% left_join(y = annotation, by = "file_name") ``` ## Protein-centric analysis For **protein-centric** analyses you can use the `proteinGroups.txt` file provided by MaxQuant. This file contains information in a wide format where each sample has its own column containing intensity values. Therefore, we need to transform this data into a long format to meet the conditions of tidy data. We will filter the data and use `tidyr`'s `pivot_longer()` to change the format to long format. Furthermore, we produce an annotation data frame to create a `conditions` column. The filtering is only done in order to remove proteins with potentially low quality. Further filtering for decoys and potential contaminants should be performed based on the data analysis vignettes. ```{r MaxQuant_protein, eval=FALSE} # To read in your own data you can use read_protti() protein_groups <- read_protti(filename = "yourpath/proteinGroups.txt") %>% # adds new column with logicals indicating if protein is a potential contaminant, # you can filter these out later on. You should also consider filtering out proteins # that were "only identified by site" and reverse hits, as well as proteins with only # one identified peptide mutate(is_potential_contaminant = ifelse(potential_contaminant == "+", TRUE, FALSE)) # Change wide format to long format and create new columns called `r_file_name`and `intensity` protein_groups_long <- protein_groups %>% pivot_longer( cols = starts_with("intensity_"), names_to = "file_name", values_to = "intensity" ) # Make an annotation data frame and merge it with your data frame to obtain conditions # We are annotating sample 1-3 as controls and samples 4-6 as treated conditions file_name <- c( # make sure that the names are the same name as in your report "intensity_sample1", "intensity_sample2", "intensity_sample3", "intensity_sample4", "intensity_sample5", "intensity_sample6" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(file_name, condition) # Combine your long data frame with the annotation protein_groups_annotated <- protein_groups_long %>% left_join(y = annotation, by = "file_name") ``` # Skyline The Skyline output is already in long format, however, to process it you need to sum up the transition intensities to obtain the intensity of one precursor. If you prefer to analyse your data on the fragment level, you should create a column that uniquely identifies each fragment of each precursor. You could do that by pasting together the peptide sequence with the charge and the product m/z. The required Skyline output columns include: * Peptide Sequence * Protein Name * Replicate Name * Precursor Charge * Product Mz (we are exporting these to distinguish transitions) * Area (or Intensity, depending on what you prefer to use) You can add replicate and condition annotations in Skyline directly. However, we will explain in this section how you can also do it in R. If you want to analyse your data on the protein abundance level you will have to combine the precursor intensities to obtain one value for protein abundance. This could be done using the `calculate_protein_abundance()` function from **protti**. ```{r Skyline, eval=FALSE} # Load data skyline_data <- read_protti(filename = "yourpath/skyline.csv") skyline_data_int <- skyline_data %>% # create a column with precursor information mutate(precursor = paste0(peptide_sequence, "_", charge)) %>% group_by(replicate_name, precursor) %>% # making a new column containing the summed up intensities of all transitions of one precursor mutate(sum_intensity = sum(area)) %>% select(-c(product_mz, area)) %>% # removing the columns we don't need distinct() # removing duplicated rows from the data frame # Add annotation # make sure that the names are the same name as in your report replicate_name <- c( "sample_1", "sample_2", "sample_3", "sample_1", "sample_2", "sample_3" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(replicate_name, condition) # Combine your long data frame with the annotation skyline_annotated <- skyline_data_int %>% left_join(y = annotation, by = "replicate_name") ``` # Proteome Discoverer The Proteome Discoverer output contains data in wide format (one column for each sample). Similar to MaxQuant there is also the option for a peptide or a protein-centric export. We will discuss both cases in this segment. ## Peptide-centric analysis/LiP-MS analysis For a **peptide-centric** or a LiP-MS analysis please export the "Peptide Groups" report. Before preparing your export you can add the column "sequence" to your table otherwise Proteome Discoverer will only export the "annotated sequence" column which includes the preceding and following amino acids in the protein sequence. The required columns include: * Sequence * Modifications * Number Proteins * Contaminant * Master Protein Accessions * Abundance or normalized abundance columns * Quan Info After saving the report as an Excel file please convert it to a .csv file, simply by opening it and saving it as such. We will read in the file using `read_protti()` and then select the columns we are interested in. You can use the `contaminant` column for qualitiy control. The `number_proteins` column contains information on the proteotypicity. If this is 1 then the peptide is proteotypic. If you want to analyse your data qualitatively only with quality control functions of **protti** you can keep peptides without quantifications. Before you start your quantitative analysis remove observations that are labeled `"No Quan Values"` in the `quan_info` column. In the below example they are filtered out at this step, but you can keep them and only filter them out later. ```{r Proteome_discoverer_pep, eval=FALSE} # Load data pd_pep_data <- read_protti("yourpath/PDpeptides.csv") # Select relevant columns pd_pep_selected <- pd_pep_data %>% select( sequence, modifications, number_proteins, contaminant, master_protein_accessions, starts_with("abundances_grouped"), # select all columns that start with "abundances_grouped" quan_info ) # Filter data frame pd_pep_filtered <- pd_pep_selected %>% filter(contaminant == FALSE) %>% # remove annotated contaminants filter(number_proteins == 1) %>% # select proteotypic peptides filter(quan_info != "No Quan Values") # remove peptides that have no quantification values # Convert into long format pd_pep_long <- pd_pep_filtered %>% pivot_longer( cols = starts_with("abundances"), names_to = "file_name", values_to = "intensity" ) %>% # combine peptide sequence and modifications to make a precursor column mutate(precursor = paste(sequence, modifications)) # Make annotation data frame file_name <- c( # make sure that the names are the same name as in your report "abundances_grouped_f1", "abundances_grouped_f2", "abundances_grouped_f3", "abundances_grouped_f4", "abundances_grouped_f5", "abundances_grouped_f6" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(file_name, condition) # Combine your long data frame with the annotation pd_pep_long_annotated <- pd_pep_long %>% left_join(y = annotation, by = "file_name") ``` ## Protein-centric analysis For a **protein-centric** or analysis please export the "Proteins" report. The required columns include: * Accession * Description * Contaminant * Number Peptides * Abundance or normalized abundance columns After saving the report as an Excel file please convert it to a .csv file, simply by opening it and saving it as such. We will read in the file using `read_protti()` and then select the columns we are interested in. Similar to above you can either filter the `contaminant` and `number_peptides` columns now or later. ```{r Proteome_discoverer_prot, eval=FALSE} # Load data pd_prot_data <- read_protti("yourpath/PDproteins.csv") # Select relevant columns pd_prot_selected <- pd_prot_data %>% select( accession, description, contaminant, number_peptides, starts_with("abundances_grouped"), # select all columns that start with "abundances_grouped" ) # Filter data frame pd_prot_data_filtered <- pd_prot_selected %>% filter(contaminant == FALSE) %>% # remove annotated contaminants filter(number_peptides > 1) # select proteins with more than one identified peptide # Convert into long format pd_prot_long <- pd_prot_data_filtered %>% pivot_longer( cols = starts_with("abundances"), names_to = "file_name", values_to = "intensity" ) # Make annotation data frame file_name <- c( # make sure that the names are the same name as in your report "abundances_grouped_f1", "abundances_grouped_f2", "abundances_grouped_f3", "abundances_grouped_f4", "abundances_grouped_f5", "abundances_grouped_f6" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(file_name, condition) # Combine your long data frame with the annotation pd_prot_long_annotated <- pd_prot_long %>% left_join(y = annotation, by = "file_name") ``` # Other search engines and software As mentioned in the beginning of this vignette you can use the output of any search engine as long as it contains the minimally required columns. If it is not in the right format you can see if some of the above transformations can be applied to your data. It is also always useful to check if you can find additional columns that help you in your analysis and that you can export from your search engine. Always make sure that all of your observations are ones you are interested in. Check if there are decoys, contaminants or non-proteotypic peptides in your data. For protein-centric analysis, potentially remove quantifications that rely on only a few peptides.