Title: | Dereplicate and Cherry-Pick Mass Spectrometry Spectra |
---|---|
Description: | Convenient wrapper functions for the analysis of matrix-assisted laser desorption/ionization-time-of-flight (MALDI-TOF) spectra data in order to select only representative spectra (also called cherry-pick). The package covers the preprocessing and dereplication steps (based on Strejcek, Smrhova, Junkova and Uhlik (2018) <doi:10.3389/fmicb.2018.01294>) needed to cluster MALDI-TOF spectra before the final cherry-picking step. It enables the easy exclusion of spectra and/or clusters to accommodate complex cherry-picking strategies. Alternatively, cherry-picking using taxonomic identification MALDI-TOF data is made easy with functions to import inconsistently formatted reports. |
Authors: | Charlie Pauvert [aut, cre, cph] , David Wylensek [ctb] , Selina Nüchtern [ctb], Thomas Clavel [ctb, fnd, cph] |
Maintainer: | Charlie Pauvert <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.3.2 |
Built: | 2024-11-21 05:06:10 UTC |
Source: | https://github.com/clavellab/maldipickr |
Assess whether all the spectra in the list are not empty, of the same length and correspond to profile data.
check_spectra(spectra_list, tolerance = sqrt(.Machine$double.eps))
check_spectra(spectra_list, tolerance = sqrt(.Machine$double.eps))
spectra_list |
A list of MALDIquant::MassSpectrum objects |
tolerance |
A numeric indicating the accepted tolerance to the spectra length. The default value is the machine numerical precision and is close to 1.5e-8. |
A list of logical vectors of length spectra_list
indicating if the spectra are empty (is_empty
), of an odd length (is_outlier_length
) or not a profile spectra (is_not_regular
).
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Display the list of checks, with FALSE where no anomaly is detected check_spectra(spectra_list) # The overall sanity can be checked with Reduce Reduce(any, check_spectra(spectra_list)) # Should be FALSE
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Display the list of checks, with FALSE where no anomaly is detected check_spectra(spectra_list) # The overall sanity can be checked with Reduce Reduce(any, check_spectra(spectra_list)) # Should be FALSE
From the report of taxonomic identification produced by the Bruker MALDI Biotyper spectra sharing the same identification are labeled in the same cluster. Spectra with unknown identification (e.g., due to database completeness) are set in unique cluster.
delineate_with_identification(tibble_report)
delineate_with_identification(tibble_report)
tibble_report |
A tibble of n rows, with n the number of spectra,
produced by |
As all unknown identification are considered unique clusters within one input tibble, it is important to consider whether the taxonomic identifications come from a single report or multiple reports, depending on the research question. A message is displayed to confirm from which type of reports the delineation was done.
A tibble of n rows for each spectra and 3 columns:
name
: the spectra names from the name
column from the output of either read_biotyper_report()
or read_many_biotyper_reports()
.
membership
: integers stating the cluster number to which the spectra belong to. It starts from 1 to c, the total number of clusters.
cluster_size
: integers indicating the total number of spectra in the corresponding cluster.
report_unknown <- read_biotyper_report( system.file("biotyper_unknown.csv", package = "maldipickr") ) delineate_with_identification(report_unknown)
report_unknown <- read_biotyper_report( system.file("biotyper_unknown.csv", package = "maldipickr") ) delineate_with_identification(report_unknown)
From a matrix of spectra similarity (e.g., with the cosine metric, or Pearson product moment), infer the species clusters based on a threshold above (or equal to) which spectra are considered alike.
delineate_with_similarity(sim_matrix, threshold, method = "complete")
delineate_with_similarity(sim_matrix, threshold, method = "complete")
sim_matrix |
A |
threshold |
A numeric value indicating the minimal similarity between two spectra. Adjust accordingly to the similarity metric used. |
method |
The method of hierarchical clustering to use. The default and recommended method is "complete", but any methods from stats::hclust are valid. |
The similarity matrix is converted to a distance matrix by subtracting the value one. This approach works for cosine similarity and positive correlations that have an upper bound of 1. Clusters are then delineated using hierarchical clustering. The default method of hierarchical clustering is the complete linkage (also known as farthest neighbor clustering) to ensure that the within-group minimum similarity of each cluster respects the threshold. See the Details section of stats::hclust for others valid methods to use.
A tibble of rows for each spectra and 3 columns:
name
: the rownames of the similarity matrix indicating the spectra names
membership
: integers stating the cluster number to which the spectra belong to. It starts from 1 to , the total number of clusters.
cluster_size
: integers indicating the total number of spectra in the corresponding cluster.
For similarity metrics: coop::tcosine
, stats::cor
, Hmisc::rcorr
. For using taxonomic identifications for clusters : delineate_with_identification. For further analyses: set_reference_spectra.
# Toy similarity matrix between the six example spectra of # three species. The cosine metric is used and a value of # zero indicates dissimilar spectra and a value of one # indicates identical spectra. cosine_similarity <- matrix( c( 1, 0.79, 0.77, 0.99, 0.98, 0.98, 0.79, 1, 0.98, 0.79, 0.8, 0.8, 0.77, 0.98, 1, 0.77, 0.77, 0.77, 0.99, 0.79, 0.77, 1, 1, 0.99, 0.98, 0.8, 0.77, 1, 1, 1, 0.98, 0.8, 0.77, 0.99, 1, 1 ), nrow = 6, dimnames = list( c( "species1_G2", "species2_E11", "species2_E12", "species3_F7", "species3_F8", "species3_F9" ), c( "species1_G2", "species2_E11", "species2_E12", "species3_F7", "species3_F8", "species3_F9" ) ) ) # Delineate clusters based on a 0.92 threshold applied # to the similarity matrix delineate_with_similarity(cosine_similarity, threshold = 0.92)
# Toy similarity matrix between the six example spectra of # three species. The cosine metric is used and a value of # zero indicates dissimilar spectra and a value of one # indicates identical spectra. cosine_similarity <- matrix( c( 1, 0.79, 0.77, 0.99, 0.98, 0.98, 0.79, 1, 0.98, 0.79, 0.8, 0.8, 0.77, 0.98, 1, 0.77, 0.77, 0.77, 0.99, 0.79, 0.77, 1, 1, 0.99, 0.98, 0.8, 0.77, 1, 1, 1, 0.98, 0.8, 0.77, 0.99, 1, 1 ), nrow = 6, dimnames = list( c( "species1_G2", "species2_E11", "species2_E12", "species3_F7", "species3_F8", "species3_F9" ), c( "species1_G2", "species2_E11", "species2_E12", "species3_F7", "species3_F8", "species3_F9" ) ) ) # Delineate clusters based on a 0.92 threshold applied # to the similarity matrix delineate_with_similarity(cosine_similarity, threshold = 0.92)
Aggregate spectra quality-check statistics
gather_spectra_stats(check_vectors)
gather_spectra_stats(check_vectors)
check_vectors |
A list of logical vectors from check_spectra |
A tibble of one row with the following 5 columns of integers:
n_spectra
: total number of raw spectra.
n_valid_spectra
: total number of spectra passing all quality checks
is_empty
, is_outlier_length
and is_not_regular
: total of spectra flagged with these irregularities.
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Display the list of checks, with FALSE where no anomaly is detected checks <- check_spectra(spectra_list) # Aggregate the statistics of quality-checked spectra gather_spectra_stats(checks)
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Display the list of checks, with FALSE where no anomaly is detected checks <- check_spectra(spectra_list) # Aggregate the statistics of quality-checked spectra gather_spectra_stats(checks)
Given the list of raw spectra, get_spectra_names()
extracts the spectra names
using the file metadata, and warns if the associated sanitized names are not unique.
get_spectra_names(spectra_list)
get_spectra_names(spectra_list)
spectra_list |
A list of MALDIquant::MassSpectrum objects. |
A tibble with four columns
sanitized_name
: spectra names based on fullName
where dots and dashes are converted to underscores
name
: spectra name using the name
label in the spectra metadata
fullName
: spectra full name using the fullName
label in the spectra metadata
file
: the path to the raw spectra data
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Extract the names get_spectra_names(spectra_list) # Artificially create duplicated entries to show the warning get_spectra_names(spectra_list[c(1,1)])
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Extract the names get_spectra_names(spectra_list) # Artificially create duplicated entries to show the warning get_spectra_names(spectra_list[c(1,1)])
This function is a wrapper around the readBrukerFlexData::readBrukerFlexDir()
to read both acqus
and acqu
MALDI files.
import_biotyper_spectra( biotyper_directory, remove_calibration = c("BTS", "Autocalibration") )
import_biotyper_spectra( biotyper_directory, remove_calibration = c("BTS", "Autocalibration") )
biotyper_directory |
A path to the folder tree with the spectra to be imported. |
remove_calibration |
A vector of characters used as regex to indicate which (calibration) spectra are going to be removed. |
When using readBrukerFlexData::readBrukerFlexDir()
on acqus
files (instead of the native acqu
files), the function will fail with the following error message:
Error in .readAcquFile(fidFile = fidFile, verbose = verbose) : File ‘/data/maldi_dir/targetA/0_D10/1/1SLin/acqu’ doesn't exists!
But it turns out that acqu
and acqus
files are the same, so the function here create acqu
symbolic links that point to acqus
files.
A list of MALDIquant::MassSpectrum objects
check_spectra, process_spectra
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Display the list of spectra spectra_list
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Display the list of spectra spectra_list
Reformat the table output from the analysis of raw Bruker MALDI Biotyper spectra by the SPeDE tool from Dumolin et al. (2019) to be consistent with the Strejcek et al. (2018) procedure followed in the maldipickr package.
import_spede_clusters(path)
import_spede_clusters(path)
path |
Path to the comma separated table generated by SPeDE |
A tibble with the following columns:
name
: a character denoting the spectra name (all spaces, dashes and dots are replaced by underscores "_" in SPeDE)
membership
: integers stating the cluster number to which the spectra belong to. It starts from 1 to c, the total number of clusters.
cluster_size
: integers indicating the total number of spectra in the corresponding cluster.
quality
: a character indicating the spectra quality category by SPeDE, out of GREEN, ORANGE and RED.
is_reference
: a logical indicating whether the corresponding spectra is a reference spectra of the cluster.
Dumolin C, Aerts M, Verheyde B, Schellaert S, Vandamme T, Van Der Jeugt F, De Canck E, Cnockaert M, Wieme AD, Cleenwerck I, Peiren J, Dawyndt P, Vandamme P, & Carlier A. (2019). "Introducing SPeDE: High-Throughput Dereplication and Accurate Determination of Microbial Diversity from Matrix-Assisted Laser Desorption–Ionization Time of Flight Mass Spectrometry Data". MSystems 4(5). doi:10.1128/msystems.00437-19.
https://github.com/LM-UGent/SPeDE
# Reformat the output from SPeDE table # https://github.com/LM-UGent/SPeDE import_spede_clusters( system.file("spede.csv", package = "maldipickr") )
# Reformat the output from SPeDE table # https://github.com/LM-UGent/SPeDE import_spede_clusters( system.file("spede.csv", package = "maldipickr") )
Identify the wells on the plate's edge
is_well_on_edge( well_number, plate_layout = c(96, 384), edges = c("top", "bottom", "left", "right"), details = FALSE )
is_well_on_edge( well_number, plate_layout = c(96, 384), edges = c("top", "bottom", "left", "right"), details = FALSE )
well_number |
A vector of positive numeric well identifier |
plate_layout |
An integer indicating the maximum number of well on the plate |
edges |
A character vector pointing which plate edges should be considered |
details |
A logical controlling whether a data.frame with more details should be returned |
Flag the wells located on the edges of a 96- or 384-well plate, based on the following well numbering:
Well numbers start at 1
Well are numbered from left to right and then top to bottom of the plate.
A logical vector, the same length as well_number
indicating whether the well is on the edge. If details = TRUE
, the function returns a data.frame that complements the logical vector with the well_number
, row and column positions.
# Logical vector indicating whether the wells are on the four edges is_well_on_edge(1:96, plate_layout = 96) # More details can be obtained to verify the results well_df <- is_well_on_edge(1:96, plate_layout = 96, details = TRUE) # And the resulting prediction displayed matrix(well_df$is_edge, ncol = max(well_df$col), byrow = TRUE)
# Logical vector indicating whether the wells are on the four edges is_well_on_edge(1:96, plate_layout = 96) # More details can be obtained to verify the results well_df <- is_well_on_edge(1:96, plate_layout = 96, details = TRUE) # And the resulting prediction displayed matrix(well_df$is_edge, ncol = max(well_df$col), byrow = TRUE)
Aggregate multiple processed spectra, their associated peaks and metadata into a feature matrix and a concatenated metadata table.
merge_processed_spectra( processed_spectra, remove_peakless_spectra = TRUE, interpolate_missing = TRUE )
merge_processed_spectra( processed_spectra, remove_peakless_spectra = TRUE, interpolate_missing = TRUE )
processed_spectra |
A list of the processed spectra and associated peaks and metadata in two possible formats:
|
remove_peakless_spectra |
A logical indicating whether to discard the spectra without detected peaks. |
interpolate_missing |
A logical indicating if intensity values for missing peaks should be interpolated from the processed spectra signal or left NA which would then be converted to 0. |
A n×p matrix, with n spectra as rows and p features as columns that are the peaks found in all the processed spectra.
When aggregating multiple runs of processed spectra, if a named list is provided, note that the names will be dropped, to prevent further downstream issues when these names were being appended to the rownames of the matrix thus preventing downstream metadata merge.
process_spectra, the "Value" section in MALDIquant::intensityMatrix
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Transform the spectra signals according to Strejcek et al. (2018) processed <- process_spectra(spectra_list) # Merge the spectra to produce the feature matrix fm <- merge_processed_spectra(list(processed)) # The feature matrix has 6 spectra as rows and # 35 peaks as columns dim(fm) # Notice the difference when the interpolation is turned off fm_no_interpolation <- merge_processed_spectra( list(processed), interpolate_missing = FALSE ) sum(fm == 0) # 0 sum(fm_no_interpolation == 0) # 68 # Multiple runs can be aggregated using list() # Merge the spectra to produce the feature matrix fm_all <- merge_processed_spectra(list(processed, processed, processed)) # The feature matrix has 3×6=18 spectra as rows and # 35 peaks as columns dim(fm_all) # If using a list, names will be dropped and are not propagated to the matrix. ## Not run: fm_all <- merge_processed_spectra( list("A" = processed, "B" = processed, "C" = processed)) any(grepl("A|B|C", rownames(fm_all))) # FALSE ## End(Not run)
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Transform the spectra signals according to Strejcek et al. (2018) processed <- process_spectra(spectra_list) # Merge the spectra to produce the feature matrix fm <- merge_processed_spectra(list(processed)) # The feature matrix has 6 spectra as rows and # 35 peaks as columns dim(fm) # Notice the difference when the interpolation is turned off fm_no_interpolation <- merge_processed_spectra( list(processed), interpolate_missing = FALSE ) sum(fm == 0) # 0 sum(fm_no_interpolation == 0) # 68 # Multiple runs can be aggregated using list() # Merge the spectra to produce the feature matrix fm_all <- merge_processed_spectra(list(processed, processed, processed)) # The feature matrix has 3×6=18 spectra as rows and # 35 peaks as columns dim(fm_all) # If using a list, names will be dropped and are not propagated to the matrix. ## Not run: fm_all <- merge_processed_spectra( list("A" = processed, "B" = processed, "C" = processed)) any(grepl("A|B|C", rownames(fm_all))) # FALSE ## End(Not run)
Using the clusters information, and potential additional metadata as external criteria, spectra are labeled as to be picked for each cluster. Note that some spectra and therefore clusters can be explicitly removed (masked) from the picking decision if they have been previously picked or should be discarded, using logical columns in the metadata table. If no metadata are provided, the reference spectra of each cluster will be picked.
pick_spectra( cluster_df, metadata_df = NULL, criteria_column = NULL, hard_mask_column = NULL, soft_mask_column = NULL, is_descending_order = TRUE, is_sorted = FALSE )
pick_spectra( cluster_df, metadata_df = NULL, criteria_column = NULL, hard_mask_column = NULL, soft_mask_column = NULL, is_descending_order = TRUE, is_sorted = FALSE )
cluster_df |
A tibble with clusters information from the delineate_with_similarity or the import_spede_clusters function. |
metadata_df |
Optional tibble with relevant metadata to guide the picking process (e.g., OD600). |
criteria_column |
Optional character indicating the column in |
hard_mask_column |
Column name in the |
soft_mask_column |
Column name in the |
is_descending_order |
Optional logical indicating whether to sort the |
is_sorted |
Optional logical to indicate that the |
A tibble with as many rows as cluster_df
with an additional logical
column named to_pick
to indicate whether the colony associated to the spectra
should be picked. If metadata_df
is provided, then additional columns from
this tibble are added to the returned tibble.
delineate_with_similarity, set_reference_spectra. For a useful utility function to soft-mask specific spectra: is_well_on_edge.
# 0. Load a toy example of a tibble of clusters created by # the `delineate_with_similarity` function. clusters <- readRDS( system.file("clusters_tibble.RDS", package = "maldipickr" ) ) # 1. By default and if no other metadata are provided, # the function picks reference spectra for each clusters. # # N.B: The spectra `name` and `to_pick` columns are moved to the left # only for clarity using the `relocate()` function. # pick_spectra(clusters) %>% dplyr::relocate(name, to_pick) # only for clarity # 2.1 Simulate OD600 values with uniform distribution # for each of the colonies we measured with # the Bruker MALDI Biotyper set.seed(104) metadata <- dplyr::transmute( clusters, name = name, OD600 = runif(n = nrow(clusters)) ) metadata # 2.2 Pick the spectra based on the highest # OD600 value per cluster pick_spectra(clusters, metadata, "OD600") %>% dplyr::relocate(name, to_pick) # only for clarity # 3.1 Say that the wells on the right side of the plate are # used for negative controls and should not be picked. metadata <- metadata %>% dplyr::mutate( well = gsub(".*[A-Z]([0-9]{1,2}$)", "\\1", name) %>% strtoi(), is_edge = is_well_on_edge( well_number = well, plate_layout = 96, edges = "right" ) ) # 3.2 Pick the spectra after discarding (or soft masking) # the spectra indicated by the `is_edge` column. pick_spectra(clusters, metadata, "OD600", soft_mask_column = "is_edge" ) %>% dplyr::relocate(name, to_pick) # only for clarity # 4.1 Say that some spectra were picked before # (e.g., in the column F) in a previous experiment. # We do not want to pick clusters with those spectra # included to limit redundancy. metadata <- metadata %>% dplyr::mutate( picked_before = grepl("_F", name) ) # 4.2 Pick the spectra from clusters without spectra # labeled as `picked_before` (hard masking). pick_spectra(clusters, metadata, "OD600", hard_mask_column = "picked_before" ) %>% dplyr::relocate(name, to_pick) # only for clarity
# 0. Load a toy example of a tibble of clusters created by # the `delineate_with_similarity` function. clusters <- readRDS( system.file("clusters_tibble.RDS", package = "maldipickr" ) ) # 1. By default and if no other metadata are provided, # the function picks reference spectra for each clusters. # # N.B: The spectra `name` and `to_pick` columns are moved to the left # only for clarity using the `relocate()` function. # pick_spectra(clusters) %>% dplyr::relocate(name, to_pick) # only for clarity # 2.1 Simulate OD600 values with uniform distribution # for each of the colonies we measured with # the Bruker MALDI Biotyper set.seed(104) metadata <- dplyr::transmute( clusters, name = name, OD600 = runif(n = nrow(clusters)) ) metadata # 2.2 Pick the spectra based on the highest # OD600 value per cluster pick_spectra(clusters, metadata, "OD600") %>% dplyr::relocate(name, to_pick) # only for clarity # 3.1 Say that the wells on the right side of the plate are # used for negative controls and should not be picked. metadata <- metadata %>% dplyr::mutate( well = gsub(".*[A-Z]([0-9]{1,2}$)", "\\1", name) %>% strtoi(), is_edge = is_well_on_edge( well_number = well, plate_layout = 96, edges = "right" ) ) # 3.2 Pick the spectra after discarding (or soft masking) # the spectra indicated by the `is_edge` column. pick_spectra(clusters, metadata, "OD600", soft_mask_column = "is_edge" ) %>% dplyr::relocate(name, to_pick) # only for clarity # 4.1 Say that some spectra were picked before # (e.g., in the column F) in a previous experiment. # We do not want to pick clusters with those spectra # included to limit redundancy. metadata <- metadata %>% dplyr::mutate( picked_before = grepl("_F", name) ) # 4.2 Pick the spectra from clusters without spectra # labeled as `picked_before` (hard masking). pick_spectra(clusters, metadata, "OD600", hard_mask_column = "picked_before" ) %>% dplyr::relocate(name, to_pick) # only for clarity
Process Bruker MALDI Biotyper spectra à la Strejcek et al. (2018)
process_spectra( spectra_list, spectra_names = get_spectra_names(spectra_list), rds_prefix = deprecated() )
process_spectra( spectra_list, spectra_names = get_spectra_names(spectra_list), rds_prefix = deprecated() )
spectra_list |
A list of MALDIquant::MassSpectrum objects. |
spectra_names |
A tibble::tibble (or data.frame) of sanitized spectra names by default from get_spectra_names. If provided manually, the column |
rds_prefix |
Writing to disk as RDS is no longer supported. A character indicating the prefix for the |
Based on the original implementation, the function performs the following tasks:
Square-root transformation
Mass range trimming to 4-10 kDa as they were deemed most determinant by Strejcek et al. (2018)
Signal smoothing using the Savitzky-Golay method and a half window size of 20
Baseline correction with the SNIP procedure
Normalization by Total Ion Current
Peak detection using the SuperSmoother procedure and with a signal-to-noise ratio above 3
Peak filtering. This step has been added to discard peaks with a negative signal-to-noise ratio probably due to being on the edge of the mass range.
A named list of three objects:
spectra
: a list the length of the spectra list of MALDIquant::MassSpectrum objects.
peaks
: a list the length of the spectra list of MALDIquant::MassPeaks objects.
metadata
: a tibble indicating the median signal-to-noise ratio (SNR
) and peaks number for all spectra list (peaks
), with spectra names in the name
column.
The original R code on which this function is based is accessible at: https://github.com/strejcem/MALDIvs16S
Strejcek M, Smrhova T, Junkova P & Uhlik O (2018). “Whole-Cell MALDI-TOF MS versus 16S rRNA Gene Analysis for Identification and Dereplication of Recurrent Bacterial Isolates.” Frontiers in Microbiology 9 doi:10.3389/fmicb.2018.01294.
import_biotyper_spectra and check_spectra for the inputs and merge_processed_spectra for further analysis.
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Transform the spectra signals according to Strejcek et al. (2018) processed <- process_spectra(spectra_list) # Overview of the list architecture that is returned # with the list of processed spectra, peaks identified and the # metadata table str(processed, max.level = 2) # A detailed view of the metadata with the median signal-to-noise # ratio (SNR) and the number of peaks processed$metadata
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import the six spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra) # Transform the spectra signals according to Strejcek et al. (2018) processed <- process_spectra(spectra_list) # Overview of the list architecture that is returned # with the list of processed spectra, peaks identified and the # metadata table str(processed, max.level = 2) # A detailed view of the metadata with the median signal-to-noise # ratio (SNR) and the number of peaks processed$metadata
The header-less table exported by the Compass software in the Bruker MALDI Biotyper device is separated by semi-colons and has empty columns which prevent an easy import in R. This function reads the report correctly as a tibble.
read_biotyper_report(path, best_hits = TRUE, long_format = TRUE)
read_biotyper_report(path, best_hits = TRUE, long_format = TRUE)
path |
Path to the semi-colon separated table |
best_hits |
A logical indicating whether to return only the best hits for each target analyzed |
long_format |
A logical indicating whether the table is in the long format (many rows) or wide format (many columns) when showing all the hits. This option has no effect when |
The header-less table contains identification information for each target processed by
the Biotyper device and once processed by the read_biotyper_report
,
the following seven columns are available in the tibble, when using the best_hits = TRUE
option:
name
: a character indicating the name of the spot of the MALDI target (i.e., plate)
sample_name
: the character string provided during the preparation of the MALDI target (i.e., plate)
hit_rank
: an integer indicating the rank of the hit for the corresponding target and identification
bruker_quality
: a character encoding the quality of the identification with potentially multiple "+" symbol or only one "-"
bruker_species
: the species name associated with the MALDI spectrum analyzed.
bruker_taxid
: the NCBI Taxonomy Identifier of the species name in the column species
bruker_hash
: a hash from an undocumented checksum function probably to encode the database entry.
bruker_log
: the log-score of the identification.
When all hits are returned (with best_hits = FALSE
), the default output format is the long format (long_format = TRUE
), meaning that the previous columns remain
unchanged, but all hits are now returned, thus increasing the number of rows.
When all hits are returned (with best_hits = FALSE
) using the wide format (long_format = FALSE), the two columns
nameand
sample_nameremains unchanged, but the five columns prefixed by
bruker_' contain the hit rank, creating a tibble of 52 columns:
bruker_01_quality
bruker_01_species
bruker_01_taxid
bruker_01_hash
bruker_01_log
bruker_02_quality
...
bruker_10_species
bruker_10_taxid
bruker_10_hash
bruker_10_log
A tibble of 7 columns (best_hits = TRUE
) or 52 columns (best_hits = FALSE
). See Details for the description of the columns.
A report that contains only spectra with no peaks found will return a tibble of 0 rows and a warning message.
# Get a example Bruker report biotyper <- system.file("biotyper.csv", package = "maldipickr") # Import the report as a tibble report_tibble <- read_biotyper_report(biotyper) # Display the tibble report_tibble
# Get a example Bruker report biotyper <- system.file("biotyper.csv", package = "maldipickr") # Import the report as a tibble report_tibble <- read_biotyper_report(biotyper) # Display the tibble report_tibble
Importing a list of Bruker MALDI Biotyper CSV reports
read_many_biotyper_reports(path_to_reports, report_ids, best_hits = TRUE, ...)
read_many_biotyper_reports(path_to_reports, report_ids, best_hits = TRUE, ...)
path_to_reports |
A vector of paths to the csv files to be imported by |
report_ids |
A vector of character names for each of the reports. |
best_hits |
A logical indicating whether to return only the best hit in the |
... |
Name-value pairs to be passed on to |
A tibble just like the one returned by the read_biotyper_report()
function, except that the name of the spot of the MALDI target (i.e., plate) is registered to the original_name
column (instead of the name
column), and the column name
consist in the provided report_ids
used as a prefix of the original_name
column.
The report identifiers are sanitized to convert all dashes (-
) as underscores (_
).
# List of Bruker MALDI Biotyper reports reports_paths <- system.file( c("biotyper.csv", "biotyper.csv", "biotyper.csv"), package = "maldipickr" ) # Read the list of reports and combine them in a single tibble read_many_biotyper_reports( reports_paths, report_ids = c("first", "second", "third"), # Additional metadata below are passed to dplyr::mutate growth_temperature = 37.0 )
# List of Bruker MALDI Biotyper reports reports_paths <- system.file( c("biotyper.csv", "biotyper.csv", "biotyper.csv"), package = "maldipickr" ) # Read the list of reports and combine them in a single tibble read_many_biotyper_reports( reports_paths, report_ids = c("first", "second", "third"), # Additional metadata below are passed to dplyr::mutate growth_temperature = 37.0 )
The remove_spectra()
function is used to discard specific spectra from (1) raw spectra list by removing them, or (2) processed spectra by removing them from the spectra, peaks and metadata objects.
remove_spectra(spectra_list, to_remove)
remove_spectra(spectra_list, to_remove)
spectra_list |
A list of MALDIquant::MassSpectrum objects OR A list of processed spectra from process_spectra |
to_remove |
The spectra to be removed. In the case of raw spectra: a logical vector same size of |
The same object as spectra_list
minus the spectra in to_remove
.
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import only the first two spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra)[1:2] # Introduce artificially an empty raw spectra spectra_list <- c(spectra_list, MALDIquant::createMassSpectrum(0, 0)) # Empty spectra are detected by `check_spectra()` # and can be removed by `remove_spectra()` spectra_list %>% remove_spectra(to_remove = check_spectra(.)) # Get an example processed spectra processed_path <- system.file( "three_processed_spectra_with_one_peakless.RDS", package = "maldipickr") processed <- readRDS(processed_path) %>% list() # Remove a specific spectra remove_spectra(processed, "empty_H12")
# Get an example directory of six Bruker MALDI Biotyper spectra directory_biotyper_spectra <- system.file( "toy-species-spectra", package = "maldipickr" ) # Import only the first two spectra spectra_list <- import_biotyper_spectra(directory_biotyper_spectra)[1:2] # Introduce artificially an empty raw spectra spectra_list <- c(spectra_list, MALDIquant::createMassSpectrum(0, 0)) # Empty spectra are detected by `check_spectra()` # and can be removed by `remove_spectra()` spectra_list %>% remove_spectra(to_remove = check_spectra(.)) # Get an example processed spectra processed_path <- system.file( "three_processed_spectra_with_one_peakless.RDS", package = "maldipickr") processed <- readRDS(processed_path) %>% list() # Remove a specific spectra remove_spectra(processed, "empty_H12")
Define a high-quality spectra as a representative spectra of the cluster based on the highest median signal-to-noise ratio and the number of detected peaks
set_reference_spectra(cluster_df, metadata_df)
set_reference_spectra(cluster_df, metadata_df)
cluster_df |
A tibble of n rows for each spectra produced by delineate_with_similarity function with at least the following columns:
|
metadata_df |
A tibble of n rows for each spectra produced by the process_spectra function with median signal-to-noise ratio ( |
A merged tibble in the same order as cluster_df
with both the columns of cluster_df
and metadata_df
, as well as a logical column is_reference
indicating if the spectrum is the reference spectra of the cluster.
delineate_with_similarity, pick_spectra
# Get an example directory of six Bruker MALDI Biotyper spectra # Import the six spectra and # Transform the spectra signals according to Strejcek et al. (2018) processed <- system.file( "toy-species-spectra", package = "maldipickr" ) %>% import_biotyper_spectra() %>% process_spectra() # Toy similarity matrix between the six example spectra of # three species. The cosine metric is used and a value of # zero indicates dissimilar spectra and a value of one # indicates identical spectra. cosine_similarity <- matrix( c( 1, 0.79, 0.77, 0.99, 0.98, 0.98, 0.79, 1, 0.98, 0.79, 0.8, 0.8, 0.77, 0.98, 1, 0.77, 0.77, 0.77, 0.99, 0.79, 0.77, 1, 1, 0.99, 0.98, 0.8, 0.77, 1, 1, 1, 0.98, 0.8, 0.77, 0.99, 1, 1 ), nrow = 6, dimnames = list( c( "species1_G2", "species2_E11", "species2_E12", "species3_F7", "species3_F8", "species3_F9" ), c( "species1_G2", "species2_E11", "species2_E12", "species3_F7", "species3_F8", "species3_F9" ) ) ) # Delineate clusters based on a 0.92 threshold applied # to the similarity matrix clusters <- delineate_with_similarity( cosine_similarity, threshold = 0.92 ) # Set reference spectra with the toy example set_reference_spectra(clusters, processed$metadata)
# Get an example directory of six Bruker MALDI Biotyper spectra # Import the six spectra and # Transform the spectra signals according to Strejcek et al. (2018) processed <- system.file( "toy-species-spectra", package = "maldipickr" ) %>% import_biotyper_spectra() %>% process_spectra() # Toy similarity matrix between the six example spectra of # three species. The cosine metric is used and a value of # zero indicates dissimilar spectra and a value of one # indicates identical spectra. cosine_similarity <- matrix( c( 1, 0.79, 0.77, 0.99, 0.98, 0.98, 0.79, 1, 0.98, 0.79, 0.8, 0.8, 0.77, 0.98, 1, 0.77, 0.77, 0.77, 0.99, 0.79, 0.77, 1, 1, 0.99, 0.98, 0.8, 0.77, 1, 1, 1, 0.98, 0.8, 0.77, 0.99, 1, 1 ), nrow = 6, dimnames = list( c( "species1_G2", "species2_E11", "species2_E12", "species3_F7", "species3_F8", "species3_F9" ), c( "species1_G2", "species2_E11", "species2_E12", "species3_F7", "species3_F8", "species3_F9" ) ) ) # Delineate clusters based on a 0.92 threshold applied # to the similarity matrix clusters <- delineate_with_similarity( cosine_similarity, threshold = 0.92 ) # Set reference spectra with the toy example set_reference_spectra(clusters, processed$metadata)