This function imports user-defined sample metadata saved in a spreadsheet.

importSampleData(file, sheet = 1L, lanes = 0L, pipeline = c("none",
  "bcbio", "cellranger", "cpi"))

Arguments

file

character(1). File path.

sheet

character(1) or integer(1). Applies to Excel Workbook, Google Sheet, or GraphPad Prism file. Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Defaults to the first sheet.

lanes

integer(1). Number of lanes used to split the samples into technical replicates suffix (i.e. _LXXX).

pipeline

character(1). Analysis pipeline:

  • "none": Simple mode, requiring only "sampleID" column.

  • "bcbio": bcbio mode. See section here in documentation for details.

  • "cellranger": Cell Ranger mode. Currently requires "directory" column. Used by Chromium R package.

  • "cpi": Constellation Pharmaceuticals (CPI) mode.

Value

DataFrame.

Note

Works with local or remote files.

Updated 2019-10-10.

bcbio pipeline

Required column names. The "description" column is always required, and must match the bcbio per sample directory names exactly. Inclusion of the "fileName" column isn't required but is recommended for data provenance. Note that some bcbio examples on readthedocs use "samplename" (note case) instead of "fileName". This function checks for that and will rename the column to "fileName" automatically. We're using the sampleName column (note case) to define unique sample names, in the event that bcbio has processed multiplexed samples.

Demultiplexed samples. The samples in the bcbio run must map to the "description" column. The values provided in description for demultiplexed samples must be unique. They must also be syntactically valid, meaning that they cannot contain illegal characters (e.g. spaces, non-alphanumerics, dashes) or begin with a number. Consult the documentation in help(topic = "make.names") for more information on valid names in R.

Multiplexed samples. This applies to some single-cell RNA-seq formats, including inDrops. In this case, bcbio will output per-sample directories with this this structure: description-revcomp. readSampleData() checks to see if the "description" column is unique. If the values are duplicated, the function assumes that bcbio processed multiplexed FASTQs, where multiple samples of interest are barcoded inside a single FASTQ. This this case, you must supply additional "index", "sequence", and "sampleName" columns. Note that bcbio currently outputs the reverse complement index sequence in the sample directory names (e.g. "sample-ATAGAGAG"). Define the forward index barcode in the sequence column here, not the reverse complement. The reverse complement will be calculated automatically and added as the revcomp column in the sample metadata.

Examples

## Demultiplexed ==== file <- file.path(basejumpTestsURL, "bcbio-metadata-demultiplexed.csv") x <- importSampleData(file, pipeline = "bcbio")
#> Importing 'bcbio-metadata-demultiplexed.csv' using 'data.table::fread()'.
#> DataFrame with 4 rows and 4 columns #> sampleName fileName description genotype #> <factor> <factor> <factor> <factor> #> sample1 sample1 sample1_R1.fastq.gz sample1 wildtype #> sample2 sample2 sample2_R1.fastq.gz sample2 knockout #> sample3 sample3 sample3_R1.fastq.gz sample3 wildtype #> sample4 sample4 sample4_R1.fastq.gz sample4 knockout
## Multiplexed ==== file <- file.path(basejumpTestsURL, "bcbio-metadata-multiplexed-indrops.csv") x <- importSampleData(file, pipeline = "bcbio")
#> Importing 'bcbio-metadata-multiplexed-indrops.csv' using 'data.table::fread()'.
#> Multiplexed samples detected.
#> DataFrame with 8 rows and 8 columns #> sampleName fileName description index #> <factor> <factor> <factor> <factor> #> indrops1_AGAGGATA sample2_1 indrops1_R1.fastq.gz indrops1-AGAGGATA 2 #> indrops1_ATAGAGAG sample1_1 indrops1_R1.fastq.gz indrops1-ATAGAGAG 1 #> indrops1_CTCCTTAC sample3_1 indrops1_R1.fastq.gz indrops1-CTCCTTAC 3 #> indrops1_TATGCAGT sample4_1 indrops1_R1.fastq.gz indrops1-TATGCAGT 4 #> indrops2_AGAGGATA sample2_2 indrops2_R1.fastq.gz indrops2-AGAGGATA 2 #> indrops2_ATAGAGAG sample1_2 indrops2_R1.fastq.gz indrops2-ATAGAGAG 1 #> indrops2_CTCCTTAC sample3_2 indrops2_R1.fastq.gz indrops2-CTCCTTAC 3 #> indrops2_TATGCAGT sample4_2 indrops2_R1.fastq.gz indrops2-TATGCAGT 4 #> sequence aggregate genotype revcomp #> <factor> <factor> <factor> <factor> #> indrops1_AGAGGATA TATCCTCT sample2 knockout AGAGGATA #> indrops1_ATAGAGAG CTCTCTAT sample1 wildtype ATAGAGAG #> indrops1_CTCCTTAC GTAAGGAG sample3 wildtype CTCCTTAC #> indrops1_TATGCAGT ACTGCATA sample4 knockout TATGCAGT #> indrops2_AGAGGATA TATCCTCT sample2 knockout AGAGGATA #> indrops2_ATAGAGAG CTCTCTAT sample1 wildtype ATAGAGAG #> indrops2_CTCCTTAC GTAAGGAG sample3 wildtype CTCCTTAC #> indrops2_TATGCAGT ACTGCATA sample4 knockout TATGCAGT