Abstract

The basejump package is an infrastructure toolkit that extends the base functionality of Bioconductor (Huber et al. 2015). The package leverages the S4 object system for object-oriented programming, and defines multiple additional generic functions for use in genomics research. basejump provides simple, user-friendly functions for the acquisition of genome annotations from multiple online databases, including native support for Ensembl (T. Hubbard et al. 2002) and websites supporting standard FASTA and GTF/GFF file formats. Consistent handling of sample metadata remains a challenge for many bioinformatic analyses, and basejump aims to address this by providing a suite of sanitization functions to help standardize these variable inputs. Additionally, interactive read/write operations in R can be cumbersome and non-trivial when working with multiple data objects; here we provide additional functions designed for interactive use that aim to reduce friction and provide consistent handling of multiple common file formats used in genomics research.

Introduction

library(basejump)
library(SummarizedExperiment)
options(acid.test = TRUE)
data(rse, package = "acidtest")

This vignette focuses on the most common user-facing functions that fall into the following categories:

  • Annotation functions
  • Read/write functions
  • Data functions
  • Syntactic naming functions
  • Math and science functions

There are additional function families whose functionality lie outside the scope of this vignette. Consult the basejump website or package documentation for more information of these types of functions, which are more technical in nature and predominantly developer-facing:

  • R Markdown functions
  • Atomic vector functions
  • Coercion methods
  • Developer functions
  • Assert check functions

The S4 object system

The basejump package defines generics and methods for object-oriented programming with the S4 object system, which is used extensively by the Bioconductor project. These resources describe the S4 object system in detail:

In R, you can check whether a function is using the S4 object system with the isS4() function. Additionally, showMethods() and getMethod() are useful for exploring source code of S4 methods. In contrast, to obtain information on functions using the S3 object system, use methods() instead.

Annotation functions

Obtaining versioned genome annotations quickly and reliably from online databases such as Ensembl (T. Hubbard et al. 2002), GENCODE (Harrow et al. 2012), RefSeq (Pruitt, Tatusova, and Maglott 2007), and the UCSC Genome Browser (Kent et al. 2002) remains overly challenging. basejump aims to help address this issue by providing native gene- and transcript-level annotation support by helping users interface with AnnotationHub and ensembldb, along with a robust set of tools to parse GTF and GFF annotation files. Annotation file parsing is handled internally by the rtracklayer package (Lawrence, Gentleman, and Carey 2009). Genome annotations are returned as GRanges class objects, defined in the GenomicRanges package (Lawrence et al. 2013), which allows for easy access to position, chromosome, strand, and additional metadata saved in the mcols() slot of the object. GRanges can be coerced to a standard data.frame using the as.data.frame() function.

Ensembl annotations

makeGRangesFromEnsembl() supports multiple genome builds, and versioned releases from Ensembl.

Current release

By default, the function will use the latest release version available from AnnotationHub and ensembldb.

grch38 <- makeGRangesFromEnsembl(organism = "Homo sapiens")
summary(grch38)
## [1] "GRanges object with 65774 ranges and 8 metadata columns"
head(names(grch38))
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"
names(metadata(grch38))
## [1] "package"        "version"        "date"           "organism"      
## [5] "genomeBuild"    "ensemblRelease" "ensembldb"      "level"         
## [9] "id"

Data inside a GRanges object can be accessed with a number of functions defined in the GenomeInfoDb and IRanges packages.

seqnames(grch38)
## factor-Rle of length 65774 with 49163 runs
##   Lengths:      2      1      4      3 ...      1      1      1      2
##   Values :      X     20      1      6 ...     10     15     22     11
## Levels(319): X 20 1 6 3 7 ... LRG_239 LRG_311 LRG_721 LRG_741 LRG_93
seqinfo(grch38)
## Seqinfo object with 319 sequences from GRCh38 genome:
##   seqnames seqlengths isCircular genome
##   X         156040895      FALSE GRCh38
##   20         64444167      FALSE GRCh38
##   1         248956422      FALSE GRCh38
##   6         170805979      FALSE GRCh38
##   3         198295559      FALSE GRCh38
##   ...             ...        ...    ...
##   LRG_239      114904      FALSE GRCh38
##   LRG_311      115492      FALSE GRCh38
##   LRG_721       33396      FALSE GRCh38
##   LRG_741      231167      FALSE GRCh38
##   LRG_93        22459      FALSE GRCh38
ranges(grch38)
## IRanges object with 65774 ranges and 0 metadata columns:
##                       start       end     width
##                   <integer> <integer> <integer>
##   ENSG00000000003 100627109 100639991     12883
##   ENSG00000000005 100584802 100599885     15084
##   ENSG00000000419  50934867  50958555     23689
##   ENSG00000000457 169849631 169894267     44637
##   ENSG00000000460 169662007 169854080    192074
##               ...       ...       ...       ...
##            LRG_94  70597348  70602775      5428
##            LRG_96  55203594  55289803     86210
##            LRG_97  37225270  37244265     18996
##            LRG_98  36568013  36579762     11750
##            LRG_99  36591943  36598262      6320
strand(grch38)
## factor-Rle of length 65774 with 32627 runs
##   Lengths: 1 1 2 1 1 1 2 1 1 1 1 4 3 1 1 ... 1 1 2 1 1 1 7 3 6 1 1 1 3 1 1
##   Values : - + - + - + - + - + - + - + - ... - + - + - + - + - + - + - + -
## Levels(3): + - *

The GRanges class is a powerful container for genomic coordinates and metadata. For example, we can easily the gene counts per chromosome.

grch38 %>%
    seqnames() %>%
    table() %>%
    sort(decreasing = TRUE) %>%
    head(n = 24) %>%
    .[sort(names(.))]
## .
##    1   10   11   12   13   14   15   16   17   18   19    2   20   21   22 
## 5457 2329 3432 3072 1392 2306 2262 2623 3148 1216 3025 4193 1449  893 1399 
##    3    4    5    6    7    8    9    X    Y 
## 3189 2643 3016 3021 3000 2455 2375 2507  544

Gene metadata is contained in mcols(), using a DataFrame internally. Here’a the current list of gene-level annotation columns returned from makeGRangesFromEnsembl():

mcols(grch38) %>% colnames()
## [1] "broadClass"     "description"    "entrezID"       "geneBiotype"   
## [5] "geneID"         "geneIDVersion"  "geneName"       "seqCoordSystem"
mcols(grch38) %>% lapply(X = ., FUN = head)
## $broadClass
## factor-Rle of length 6 with 1 run
##   Lengths:      6
##   Values : coding
## Levels(8): coding ig mito noncoding other pseudo small tcr
## 
## $description
## factor-Rle of length 6 with 6 runs
##   Lengths:                                                              ...
##   Values :                                                          tetr...
## Levels(35744): 1-acylglycerol-3-phosphate O-acyltransferase 1 [Source:HG...
## 
## $entrezID
## $entrezID$ENSG00000000003
## [1] 7105
## 
## $entrezID$ENSG00000000005
## [1] 64102
## 
## $entrezID$ENSG00000000419
## [1] 8813
## 
## $entrezID$ENSG00000000457
## [1] 57147
## 
## $entrezID$ENSG00000000460
## [1] 55732
## 
## $entrezID$ENSG00000000938
## [1] 2268
## 
## 
## $geneBiotype
## factor-Rle of length 6 with 1 run
##   Lengths:              6
##   Values : protein_coding
## Levels(45): 3prime_overlapping_ncrna antisense ... vaultRNA
## 
## $geneID
## character-Rle of length 6 with 6 runs
##   Lengths:                 1                 1 ...                 1
##   Values : "ENSG00000000003" "ENSG00000000005" ... "ENSG00000000938"
## 
## $geneIDVersion
## character-Rle of length 6 with 6 runs
##   Lengths:                    1 ...                    1
##   Values : "ENSG00000000003.13" ... "ENSG00000000938.11"
## 
## $geneName
## factor-Rle of length 6 with 6 runs
##   Lengths:        1        1        1        1        1        1
##   Values :   TSPAN6     TNMD     DPM1    SCYL3 C1orf112      FGR
## Levels(59074): 5_8S_rRNA 5S_rRNA 7SK A1BG ... ZYG11B ZYX ZZEF1 ZZZ3
## 
## $seqCoordSystem
## factor-Rle of length 6 with 1 run
##   Lengths:          6
##   Values : chromosome
## Levels(3): chromosome lrg scaffold

Transcript-level annotations are also supported, using the level = "transcripts" argument.

makeGRangesFromEnsembl(organism = "Homo sapiens", level = "transcripts")
## GRanges object with 214285 ranges and 14 metadata columns:
##                   seqnames              ranges strand | broadClass
##                      <Rle>           <IRanges>  <Rle> |      <Rle>
##   ENST00000000233        7 127588345-127591705      + |     coding
##   ENST00000000412       12     8940365-8949955      - |     coding
##   ENST00000000442       11   64305578-64316738      + |     coding
##   ENST00000001008       12     2794953-2805423      + |     coding
##   ENST00000001146        2   72129238-72148038      - |     coding
##               ...      ...                 ...    ... .        ...
##          LRG_94t1       10   70597348-70602775      - |      other
##          LRG_96t1       15   55203594-55289803      - |      other
##          LRG_97t1       22   37225270-37244265      - |      other
##          LRG_98t1       11   36568013-36579762      + |      other
##          LRG_99t1       11   36591943-36598262      - |      other
##                                                                                                                                  description
##                                                                                                                                        <Rle>
##   ENST00000000233                                                                ADP-ribosylation factor 5 [Source:HGNC Symbol;Acc:HGNC:658]
##   ENST00000000412                                         mannose-6-phosphate receptor (cation dependent) [Source:HGNC Symbol;Acc:HGNC:6752]
##   ENST00000000442                                                         estrogen-related receptor alpha [Source:HGNC Symbol;Acc:HGNC:3471]
##   ENST00000001008                                                          FK506 binding protein 4, 59kDa [Source:HGNC Symbol;Acc:HGNC:3720]
##   ENST00000001146                                 cytochrome P450, family 26, subfamily B, polypeptide 1 [Source:HGNC Symbol;Acc:HGNC:20581]
##               ...                                                                                                                        ...
##          LRG_94t1                                                       perforin 1 (pore forming protein) [Source:HGNC Symbol;Acc:HGNC:9360]
##          LRG_96t1                                                      RAB27A, member RAS oncogene family [Source:HGNC Symbol;Acc:HGNC:9766]
##          LRG_97t1 ras-related C3 botulinum toxin substrate 2 (rho family, small GTP binding protein Rac2) [Source:HGNC Symbol;Acc:HGNC:9802]
##          LRG_98t1                                                         recombination activating gene 1 [Source:HGNC Symbol;Acc:HGNC:9831]
##          LRG_99t1                                                         recombination activating gene 2 [Source:HGNC Symbol;Acc:HGNC:9832]
##                   entrezID    geneBiotype          geneID
##                     <list>          <Rle>           <Rle>
##   ENST00000000233      381 protein_coding ENSG00000004059
##   ENST00000000412     4074 protein_coding ENSG00000003056
##   ENST00000000442     2101 protein_coding ENSG00000173153
##   ENST00000001008     2288 protein_coding ENSG00000004478
##   ENST00000001146    56603 protein_coding ENSG00000003137
##               ...      ...            ...             ...
##          LRG_94t1     5551       LRG_gene          LRG_94
##          LRG_96t1     5873       LRG_gene          LRG_96
##          LRG_97t1     5880       LRG_gene          LRG_97
##          LRG_98t1     5896       LRG_gene          LRG_98
##          LRG_99t1     5897       LRG_gene          LRG_99
##                        geneIDVersion geneName seqCoordSystem
##                                <Rle>    <Rle>          <Rle>
##   ENST00000000233  ENSG00000004059.9     ARF5     chromosome
##   ENST00000000412  ENSG00000003056.6     M6PR     chromosome
##   ENST00000000442 ENSG00000173153.12    ESRRA     chromosome
##   ENST00000001008  ENSG00000004478.7    FKBP4     chromosome
##   ENST00000001146  ENSG00000003137.7  CYP26B1     chromosome
##               ...                ...      ...            ...
##          LRG_94t1           LRG_94.1     PRF1     chromosome
##          LRG_96t1           LRG_96.1   RAB27A     chromosome
##          LRG_97t1           LRG_97.1     RAC2     chromosome
##          LRG_98t1           LRG_98.1     RAG1     chromosome
##          LRG_99t1           LRG_99.1     RAG2     chromosome
##                   transcriptBiotype transcriptCdsSeqEnd
##                               <Rle>               <Rle>
##   ENST00000000233    protein_coding           127591299
##   ENST00000000412    protein_coding             8946404
##   ENST00000000442    protein_coding            64315966
##   ENST00000001008    protein_coding             2803258
##   ENST00000001146    protein_coding            72147834
##               ...               ...                 ...
##          LRG_94t1          LRG_gene            70600902
##          LRG_96t1          LRG_gene            55234934
##          LRG_97t1          LRG_gene            37244148
##          LRG_98t1          LRG_gene            36576436
##          LRG_99t1          LRG_gene            36594168
##                   transcriptCdsSeqStart    transcriptID
##                                   <Rle>           <Rle>
##   ENST00000000233             127588499 ENST00000000233
##   ENST00000000412               8941818 ENST00000000412
##   ENST00000000442              64307180 ENST00000000442
##   ENST00000001008               2795140 ENST00000001008
##   ENST00000001146              72132227 ENST00000001146
##               ...                   ...             ...
##          LRG_94t1              70598053        LRG_94t1
##          LRG_96t1              55205507        LRG_96t1
##          LRG_97t1              37226673        LRG_97t1
##          LRG_98t1              36573305        LRG_98t1
##          LRG_99t1              36592585        LRG_99t1
##                   transcriptIDVersion  transcriptName
##                                 <Rle>           <Rle>
##   ENST00000000233   ENST00000000233.8 ENST00000000233
##   ENST00000000412   ENST00000000412.6 ENST00000000412
##   ENST00000000442   ENST00000000442.9 ENST00000000442
##   ENST00000001008   ENST00000001008.5 ENST00000001008
##   ENST00000001146   ENST00000001146.5 ENST00000001146
##               ...                 ...             ...
##          LRG_94t1          LRG_94t1.1        LRG_94t1
##          LRG_96t1          LRG_96t1.1        LRG_96t1
##          LRG_97t1          LRG_97t1.1        LRG_97t1
##          LRG_98t1          LRG_98t1.1        LRG_98t1
##          LRG_99t1          LRG_99t1.1        LRG_99t1
##   -------
##   seqinfo: 319 sequences from GRCh38 genome

Versioned releases

Versioned releases are supported by the “release” argument, which currently supports back to Ensembl 87. If an older Ensembl release is required, resort to using a GTF/GFF file or browse the Bioconductor website to see if a suitable annotation database package is available.

makeGRangesFromEnsembl(
    organism = "Homo sapiens",
    genomeBuild = "GRCh38",
    release = 90
)
## GRanges object with 64661 ranges and 8 metadata columns:
##                   seqnames              ranges strand | broadClass
##                      <Rle>           <IRanges>  <Rle> |      <Rle>
##   ENSG00000000003        X 100627109-100639991      - |     coding
##   ENSG00000000005        X 100584802-100599885      + |     coding
##   ENSG00000000419       20   50934867-50958555      - |     coding
##   ENSG00000000457        1 169849631-169894267      - |     coding
##   ENSG00000000460        1 169662007-169854080      + |     coding
##               ...      ...                 ...    ... .        ...
##           LRG_995        1   77948402-77979205      - |      other
##           LRG_996       12   56080025-56103507      + |      other
##           LRG_997        6 117288367-117425855      - |      other
##           LRG_998        6   41934933-42048894      - |      other
##           LRG_999       19   42268537-42295796      + |      other
##                                                                                                      description
##                                                                                                            <Rle>
##   ENSG00000000003                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
##   ENSG00000000005                                                tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
##   ENSG00000000419 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
##   ENSG00000000457                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
##   ENSG00000000460                        chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
##               ...                                                                                            ...
##           LRG_995                      far upstream element binding protein 1 [Source:HGNC Symbol;Acc:HGNC:4004]
##           LRG_996                           erb-b2 receptor tyrosine kinase 3 [Source:HGNC Symbol;Acc:HGNC:3431]
##           LRG_997             ROS proto-oncogene 1, receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:10261]
##           LRG_998                                                   cyclin D3 [Source:HGNC Symbol;Acc:HGNC:1585]
##           LRG_999                          capicua transcriptional repressor [Source:HGNC Symbol;Acc:HGNC:14214]
##                   entrezID    geneBiotype          geneID
##                     <list>          <Rle>           <Rle>
##   ENSG00000000003     7105 protein_coding ENSG00000000003
##   ENSG00000000005    64102 protein_coding ENSG00000000005
##   ENSG00000000419     8813 protein_coding ENSG00000000419
##   ENSG00000000457    57147 protein_coding ENSG00000000457
##   ENSG00000000460    55732 protein_coding ENSG00000000460
##               ...      ...            ...             ...
##           LRG_995     8880       LRG_gene         LRG_995
##           LRG_996     2065       LRG_gene         LRG_996
##           LRG_997     6098       LRG_gene         LRG_997
##           LRG_998      896       LRG_gene         LRG_998
##           LRG_999    23152       LRG_gene         LRG_999
##                        geneIDVersion geneName seqCoordSystem
##                                <Rle>    <Rle>          <Rle>
##   ENSG00000000003 ENSG00000000003.14   TSPAN6     chromosome
##   ENSG00000000005  ENSG00000000005.5     TNMD     chromosome
##   ENSG00000000419 ENSG00000000419.12     DPM1     chromosome
##   ENSG00000000457 ENSG00000000457.13    SCYL3     chromosome
##   ENSG00000000460 ENSG00000000460.16 C1orf112     chromosome
##               ...                ...      ...            ...
##           LRG_995          LRG_995.1    FUBP1     chromosome
##           LRG_996          LRG_996.1    ERBB3     chromosome
##           LRG_997          LRG_997.1     ROS1     chromosome
##           LRG_998          LRG_998.1    CCND3     chromosome
##           LRG_999          LRG_999.1      CIC     chromosome
##   -------
##   seqinfo: 388 sequences from GRCh38 genome

GRCh37

The legacy Ensembl GRCh37 genome build (release 75; also known as UCSC hg19) is also supported, but is no longer recommended for use in new analyses. Internally, support for GRCh37 is provided by the EnsDb.Hsapiens.v75 annotation database package.

makeGRangesFromEnsembl(
    organism = "Homo sapiens",
    genomeBuild = "GRCh37"
)
## GRanges object with 64102 ranges and 6 metadata columns:
##                   seqnames              ranges strand | broadClass
##                      <Rle>           <IRanges>  <Rle> |      <Rle>
##   ENSG00000000003        X   99883667-99894988      - |     coding
##   ENSG00000000005        X   99839799-99854882      + |     coding
##   ENSG00000000419       20   49551404-49575092      - |     coding
##   ENSG00000000457        1 169818772-169863408      - |     coding
##   ENSG00000000460        1 169631245-169823221      + |     coding
##               ...      ...                 ...    ... .        ...
##            LRG_94       10   72357104-72362531      - |      other
##            LRG_96       15   55495792-55582001      - |      other
##            LRG_97       22   37621310-37640305      - |      other
##            LRG_98       11   36589563-36601312      + |      other
##            LRG_99       11   36613493-36619812      - |      other
##                   entrezID    geneBiotype          geneID geneName
##                     <list>          <Rle>           <Rle>    <Rle>
##   ENSG00000000003     7105 protein_coding ENSG00000000003   TSPAN6
##   ENSG00000000005    64102 protein_coding ENSG00000000005     TNMD
##   ENSG00000000419     8813 protein_coding ENSG00000000419     DPM1
##   ENSG00000000457    57147 protein_coding ENSG00000000457    SCYL3
##   ENSG00000000460    55732 protein_coding ENSG00000000460 C1orf112
##               ...      ...            ...             ...      ...
##            LRG_94     5551       LRG_gene          LRG_94   LRG_94
##            LRG_96     5873       LRG_gene          LRG_96   LRG_96
##            LRG_97     5880       LRG_gene          LRG_97   LRG_97
##            LRG_98     5896       LRG_gene          LRG_98   LRG_98
##            LRG_99     5897       LRG_gene          LRG_99   LRG_99
##                   seqCoordSystem
##                            <Rle>
##   ENSG00000000003     chromosome
##   ENSG00000000005     chromosome
##   ENSG00000000419     chromosome
##   ENSG00000000457     chromosome
##   ENSG00000000460     chromosome
##               ...            ...
##            LRG_94     chromosome
##            LRG_96     chromosome
##            LRG_97     chromosome
##            LRG_98     chromosome
##            LRG_99     chromosome
##   -------
##   seqinfo: 273 sequences from GRCh37 genome

Gene mapping functions

Gene-to-symbol and transcript-to-gene mappings can be easily acquired with the gene2symbol() and tx2gene() functions. Both of these functions return a data.frame.

makeGene2SymbolFromEnsembl(organism = "Homo sapiens")
## Gene2Symbol with 65774 rows and 2 columns
##                          geneID    geneName
##                     <character> <character>
## ENSG00000000003 ENSG00000000003      TSPAN6
## ENSG00000000005 ENSG00000000005        TNMD
## ENSG00000000419 ENSG00000000419        DPM1
## ENSG00000000457 ENSG00000000457       SCYL3
## ENSG00000000460 ENSG00000000460    C1orf112
## ...                         ...         ...
## LRG_94                   LRG_94      PRF1.1
## LRG_96                   LRG_96    RAB27A.1
## LRG_97                   LRG_97      RAC2.1
## LRG_98                   LRG_98      RAG1.1
## LRG_99                   LRG_99      RAG2.1
makeTx2GeneFromEnsembl(organism = "Homo sapiens")
## Tx2Gene with 214285 rows and 2 columns
##                    transcriptID          geneID
##                     <character>     <character>
## ENST00000000233 ENST00000000233 ENSG00000004059
## ENST00000000412 ENST00000000412 ENSG00000003056
## ENST00000000442 ENST00000000442 ENSG00000173153
## ENST00000001008 ENST00000001008 ENSG00000004478
## ENST00000001146 ENST00000001146 ENSG00000003137
## ...                         ...             ...
## LRG_94t1               LRG_94t1          LRG_94
## LRG_96t1               LRG_96t1          LRG_96
## LRG_97t1               LRG_97t1          LRG_97
## LRG_98t1               LRG_98t1          LRG_98
## LRG_99t1               LRG_99t1          LRG_99

Parse GTF/GFF files

For organisms that are not well supported on Ensembl, we recommend using GTF or GFF files for gene annotations. In general, GTF files are more consistently standardized and recommended over GFF3, if possible. makeTx2geneFromGFF() and makeGene2symbolFromGFF() are utility functions that can return transcript-to-gene and gene-to-symbol mappings from the genomic ranges returned by makeGRangesFromGFF().

GTF

makeGRangesFromGFF(
    file = pasteURL(
        "ftp.ensembl.org",
        "pub",
        "release-92",
        "gtf",
        "homo_sapiens",
        "Homo_sapiens.GRCh38.92.gtf.gz",
        protocol = "ftp"
    )
)

GFF3

makeGRangesFromGFF(
    file = pasteURL(
        "ftp.ensembl.org",
        "pub",
        "release-92",
        "gff3",
        "homo_sapiens",
        "Homo_sapiens.GRCh38.92.gff3.gz",
        protocol = "ftp"
    )
)

Gene interconversion functions

Dealing with gene identifier to gene name (symbol) mapping in a non-destructive manner remains challenging for the bioinformatics community. basejump provides convertGenesToSymbols() and convertSymbolsToGenes() generics that define methods for SummarizedExperiment, making interconversion between gene IDs and gene names easier. Internally, these mappings are handled by the geneID and geneName mcols columns inside of the rowRanges slot.

rse_symbols <- convertGenesToSymbols(rse)
head(rownames(rse_symbols))
## [1] "TSPAN6"   "TNMD"     "DPM1"     "SCYL3"    "C1orf112" "FGR"
rse_genes <- convertSymbolsToGenes(rse_symbols)
head(rownames(rse_genes))
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"

To convert from transcript- to gene-level easily, the convertTranscriptsToGenes() and stripTranscriptVersions() functions are also provided (not shown).

Query external annotation databases

In addition to interfacing with Ensembl, basejump exports a number of utility functions to interface with other genomic databases, including:

  • EggNOG (Huerta-Cepas et al. 2016).
  • HGNC (Yates et al. 2017).
  • MGI (C. L. Smith et al. 2018).
  • PANTHER (Mi et al. 2017).

HGNC

hgnc2gene <- HGNC2Ensembl()
summary(hgnc2gene)
##       Length        Class         Mode 
##            2 HGNC2Ensembl           S4

MGI

mgi2ensembl <- MGI2Ensembl()
summary(mgi2ensembl)
##      Length       Class        Mode 
##           2 MGI2Ensembl          S4

EggNOG

eggnog <- EggNOG()
summary(eggnog)
## Length  Class   Mode 
##      2 EggNOG     S4

PANTHER

panther <- PANTHER(organism = "Homo sapiens")
summary(panther)
##  Length   Class    Mode 
##       9 PANTHER      S4

WormBase

Note that for Caenorhabditis elegans genome annotations, we recommend additional use of our specialized wormbase package, which queries the WormBase database (Stein et al. 2001) directly, and provides support of versioned releases (e.g. WS265) that have additional metadata not currently available on Ensembl.

Import/export functions

FIXME This section is still under development.

Data functions

FIXME This section is still under development.

Sanitization functions

The makeNames suite of function, consisting primarily of camel(), dotted(), snake(), and upperCamel(), provide S4 method support for sanitizing common data structures used in genomics research and on Bioconductor.

To see how these functions work, let’s load up an example list named mn (short for makeNames).

loadRemoteData(url = file.path(basejumpTestsURL, "mn.rda"))
class(mn)
## [1] "list"
names(mn)
## [1] "character"      "namedCharacter" "factor"         "dataFrame"     
## [5] "matrix"         "tibble"         "list"

Atomic vectors

Character

x <- mn$character
print(x)
##  [1] "hello world"     "HELLO WORLD"     "RNAi clones"    
##  [4] "nCount"          "tx2gene"         "TX2GeneID"      
##  [7] "G2M.Score"       "worfdbHTMLRemap" "Mazda RX4"      
## [10] "%GC"             "5prime"          "5'-3' bias"     
## [13] "123"             NA
camel(x)
##  [1] "helloWorld"      "helloWORLD"      "rnaiClones"     
##  [4] "nCount"          "tx2gene"         "tx2GeneID"      
##  [7] "g2mScore"        "worfdbHTMLRemap" "mazdaRX4"       
## [10] "percentGC"       "x5prime"         "x5x3Bias"       
## [13] "x123"            NA
dotted(x)
##  [1] "hello.world"       "HELLO.WORLD"       "RNAI.clones"      
##  [4] "n.Count"           "tx2gene"           "TX2.Gene.ID"      
##  [7] "G2M.Score"         "worfdb.HTML.Remap" "Mazda.RX4"        
## [10] "percent.GC"        "X5prime"           "X5.3.bias"        
## [13] "X123"              NA
snake(x)
##  [1] "hello_world"       "hello_world"       "rnai_clones"      
##  [4] "n_count"           "tx2gene"           "tx2_gene_id"      
##  [7] "g2m_score"         "worfdb_html_remap" "mazda_rx4"        
## [10] "percent_gc"        "x5prime"           "x5_3_bias"        
## [13] "x123"              NA
upperCamel(x)
##  [1] "HelloWorld"      "HELLOWORLD"      "RNAIClones"     
##  [4] "NCount"          "Tx2gene"         "TX2GeneID"      
##  [7] "G2MScore"        "WorfdbHTMLRemap" "MazdaRX4"       
## [10] "PercentGC"       "X5prime"         "X5X3Bias"       
## [13] "X123"            NA
makeNames(x)
##  [1] "hello_world"     "HELLO_WORLD"     "RNAi_clones"    
##  [4] "nCount"          "tx2gene"         "TX2GeneID"      
##  [7] "G2M_Score"       "worfdbHTMLRemap" "Mazda_RX4"      
## [10] "X_GC"            "X5prime"         "X5__3__bias"    
## [13] "X123"            "NA_"

Named character

x <- mn$namedCharacter
print(x)
##        Item.A        Item.B 
## "hello world" "HELLO WORLD"
camel(x)
##        itemA        itemB 
## "helloWorld" "helloWORLD"
dotted(x)
##        Item.A        Item.B 
## "hello.world" "HELLO.WORLD"
snake(x)
##        item_a        item_b 
## "hello_world" "hello_world"
upperCamel(x)
##        ItemA        ItemB 
## "HelloWorld" "HELLOWORLD"
makeNames(x)
## [1] "hello_world" "HELLO_WORLD"

Factor

x <- mn$factor
print(x)
## sample 1 sample 2 sample 3 sample 4 
##  group 1  group 1  group 2  group 2 
## Levels: group 1 group 2
camel(x)
## sample1 sample2 sample3 sample4 
##  group1  group1  group2  group2 
## Levels: group1 group2
dotted(x)
## sample.1 sample.2 sample.3 sample.4 
##  group.1  group.1  group.2  group.2 
## Levels: group.1 group.2
snake(x)
## sample_1 sample_2 sample_3 sample_4 
##  group_1  group_1  group_2  group_2 
## Levels: group_1 group_2
upperCamel(x)
## Sample1 Sample2 Sample3 Sample4 
##  Group1  Group1  Group2  Group2 
## Levels: Group1 Group2
makeNames(x)
## [1] "group_1"   "group_1_1" "group_2"   "group_2_1"

Data frames

x <- datasets::USArrests
dimnames(x)
## [[1]]
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"       
## 
## [[2]]
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"
camel(x, rownames = TRUE, colnames = TRUE) %>% dimnames()
## [[1]]
##  [1] "alabama"       "alaska"        "arizona"       "arkansas"     
##  [5] "california"    "colorado"      "connecticut"   "delaware"     
##  [9] "florida"       "georgia"       "hawaii"        "idaho"        
## [13] "illinois"      "indiana"       "iowa"          "kansas"       
## [17] "kentucky"      "louisiana"     "maine"         "maryland"     
## [21] "massachusetts" "michigan"      "minnesota"     "mississippi"  
## [25] "missouri"      "montana"       "nebraska"      "nevada"       
## [29] "newHampshire"  "newJersey"     "newMexico"     "newYork"      
## [33] "northCarolina" "northDakota"   "ohio"          "oklahoma"     
## [37] "oregon"        "pennsylvania"  "rhodeIsland"   "southCarolina"
## [41] "southDakota"   "tennessee"     "texas"         "utah"         
## [45] "vermont"       "virginia"      "washington"    "westVirginia" 
## [49] "wisconsin"     "wyoming"      
## 
## [[2]]
## [1] "murder"   "assault"  "urbanPop" "rape"
dotted(x, rownames = TRUE, colnames = TRUE) %>% dimnames()
## [[1]]
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New.Hampshire"  "New.Jersey"     "New.Mexico"     "New.York"      
## [33] "North.Carolina" "North.Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode.Island"   "South.Carolina"
## [41] "South.Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West.Virginia" 
## [49] "Wisconsin"      "Wyoming"       
## 
## [[2]]
## [1] "Murder"    "Assault"   "Urban.Pop" "Rape"
snake(x, rownames = TRUE, colnames = TRUE) %>% dimnames()
## [[1]]
##  [1] "alabama"        "alaska"         "arizona"        "arkansas"      
##  [5] "california"     "colorado"       "connecticut"    "delaware"      
##  [9] "florida"        "georgia"        "hawaii"         "idaho"         
## [13] "illinois"       "indiana"        "iowa"           "kansas"        
## [17] "kentucky"       "louisiana"      "maine"          "maryland"      
## [21] "massachusetts"  "michigan"       "minnesota"      "mississippi"   
## [25] "missouri"       "montana"        "nebraska"       "nevada"        
## [29] "new_hampshire"  "new_jersey"     "new_mexico"     "new_york"      
## [33] "north_carolina" "north_dakota"   "ohio"           "oklahoma"      
## [37] "oregon"         "pennsylvania"   "rhode_island"   "south_carolina"
## [41] "south_dakota"   "tennessee"      "texas"          "utah"          
## [45] "vermont"        "virginia"       "washington"     "west_virginia" 
## [49] "wisconsin"      "wyoming"       
## 
## [[2]]
## [1] "murder"    "assault"   "urban_pop" "rape"
upperCamel(x, rownames = TRUE, colnames = TRUE) %>% dimnames()
## [[1]]
##  [1] "Alabama"       "Alaska"        "Arizona"       "Arkansas"     
##  [5] "California"    "Colorado"      "Connecticut"   "Delaware"     
##  [9] "Florida"       "Georgia"       "Hawaii"        "Idaho"        
## [13] "Illinois"      "Indiana"       "Iowa"          "Kansas"       
## [17] "Kentucky"      "Louisiana"     "Maine"         "Maryland"     
## [21] "Massachusetts" "Michigan"      "Minnesota"     "Mississippi"  
## [25] "Missouri"      "Montana"       "Nebraska"      "Nevada"       
## [29] "NewHampshire"  "NewJersey"     "NewMexico"     "NewYork"      
## [33] "NorthCarolina" "NorthDakota"   "Ohio"          "Oklahoma"     
## [37] "Oregon"        "Pennsylvania"  "RhodeIsland"   "SouthCarolina"
## [41] "SouthDakota"   "Tennessee"     "Texas"         "Utah"         
## [45] "Vermont"       "Virginia"      "Washington"    "WestVirginia" 
## [49] "Wisconsin"     "Wyoming"      
## 
## [[2]]
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

Lists

x <- mn$list
print(x)
## $Item.A
## [1] 1 2
## 
## $Item.B
## [1] 3 4
camel(x) %>% names()
## [1] "itemA" "itemB"
dotted(x) %>% names()
## [1] "Item.A" "Item.B"
snake(x) %>% names()
## [1] "item_a" "item_b"
upperCamel(x) %>% names()
## [1] "ItemA" "ItemB"

R session information

utils::sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
##  [3] BiocParallel_1.18.0         matrixStats_0.54.0         
##  [5] Biobase_2.44.0              GenomicRanges_1.36.0       
##  [7] GenomeInfoDb_1.20.0         IRanges_2.18.1             
##  [9] S4Vectors_0.22.0            BiocGenerics_0.30.0        
## [11] basejump_0.10.11            BiocStyle_2.12.0           
## 
## loaded via a namespace (and not attached):
##  [1] ProtGenerics_1.16.0           bitops_1.0-6                 
##  [3] fs_1.3.1                      bit64_0.9-7                  
##  [5] progress_1.2.2                httr_1.4.0                   
##  [7] rprojroot_1.3-2               syntactic_0.1.10             
##  [9] tools_3.6.1                   backports_1.1.4              
## [11] R6_2.4.0                      lazyeval_0.2.2               
## [13] DBI_1.0.0                     withr_2.1.2                  
## [15] prettyunits_1.0.2             tidyselect_0.2.5             
## [17] bit_1.1-14                    curl_3.3                     
## [19] compiler_3.6.1                cli_1.1.0                    
## [21] xml2_1.2.0.9000               desc_1.2.0                   
## [23] rtracklayer_1.44.0            bookdown_0.12                
## [25] readr_1.3.1                   rappdirs_0.3.1               
## [27] pkgdown_1.3.0.9100            commonmark_1.7               
## [29] stringr_1.4.0                 digest_0.6.20                
## [31] Rsamtools_2.0.0               rmarkdown_1.14               
## [33] R.utils_2.9.0                 XVector_0.24.0               
## [35] pkgconfig_2.0.2               htmltools_0.3.6              
## [37] sessioninfo_1.1.1             ensembldb_2.8.0              
## [39] dbplyr_1.4.2                  rlang_0.4.0                  
## [41] RSQLite_2.1.1                 shiny_1.3.2                  
## [43] dplyr_0.8.3                   R.oo_1.22.0                  
## [45] RCurl_1.95-4.12               magrittr_1.5                 
## [47] GenomeInfoDbData_1.2.1        Matrix_1.2-17                
## [49] Rcpp_1.0.1                    R.methodsS3_1.7.1            
## [51] stringi_1.4.3                 yaml_2.2.0                   
## [53] MASS_7.3-51.4                 zlibbioc_1.30.0              
## [55] brio_0.2.4                    plyr_1.8.4                   
## [57] goalie_0.2.18                 BiocFileCache_1.8.0          
## [59] AnnotationHub_2.16.0          grid_3.6.1                   
## [61] blob_1.2.0                    transformer_0.1.12           
## [63] promises_1.0.1                crayon_1.3.4                 
## [65] lattice_0.20-38               Biostrings_2.52.0            
## [67] bioverbs_0.1.20               GenomicFeatures_1.36.4       
## [69] hms_0.5.0                     zeallot_0.1.0                
## [71] knitr_1.23                    pillar_1.4.2                 
## [73] reshape2_1.4.3                biomaRt_2.40.1               
## [75] XML_3.98-1.20                 glue_1.3.1                   
## [77] evaluate_0.14                 freerange_0.1.9              
## [79] data.table_1.12.2             BiocManager_1.30.4           
## [81] vctrs_0.2.0                   httpuv_1.5.1                 
## [83] tidyr_0.8.3                   grr_0.9.5                    
## [85] purrr_0.3.2                   assertthat_0.2.1             
## [87] xfun_0.8                      mime_0.7                     
## [89] xtable_1.8-4                  AnnotationFilter_1.8.0       
## [91] roxygen2_6.1.1                later_0.8.0                  
## [93] SingleCellExperiment_1.6.0    tibble_2.1.3                 
## [95] GenomicAlignments_1.20.1      Matrix.utils_0.9.7           
## [97] AnnotationDbi_1.46.0          memoise_1.1.0                
## [99] interactiveDisplayBase_1.22.0

References

The papers and software cited in our workflows are available as a shared library on Paperpile.

Harrow, Jennifer, Adam Frankish, Jose M Gonzalez, Electra Tapanari, Mark Diekhans, Felix Kokocinski, Bronwen L Aken, et al. 2012. “GENCODE: the Reference Human Genome Annotation for the ENCODE Project.” Genome Res. 22 (9) (September): 1760–1774. doi:10.1101/gr.135350.111. http://dx.doi.org/10.1101/gr.135350.111.

Hubbard, T, D Barker, E Birney, G Cameron, Y Chen, L Clark, T Cox, et al. 2002. “The Ensembl Genome Database Project.” Nucleic Acids Res. 30 (1) (January): 38–41. https://www.ncbi.nlm.nih.gov/pubmed/11752248.

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat. Methods 12 (2) (February): 115–121. doi:10.1038/nmeth.3252. http://dx.doi.org/10.1038/nmeth.3252.

Huerta-Cepas, Jaime, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide Heller, Mathias C Walter, Thomas Rattei, et al. 2016. “eggNOG 4.5: a Hierarchical Orthology Framework with Improved Functional Annotations for Eukaryotic, Prokaryotic and Viral Sequences.” Nucleic Acids Res. 44 (D1) (January): D286–93. doi:10.1093/nar/gkv1248. http://dx.doi.org/10.1093/nar/gkv1248.

Kent, W James, Charles W Sugnet, Terrence S Furey, Krishna M Roskin, Tom H Pringle, Alan M Zahler, and David Haussler. 2002. “The Human Genome Browser at UCSC.” Genome Res. 12 (6) (June): 996–1006. doi:10.1101/gr.229102. http://dx.doi.org/10.1101/gr.229102.

Lawrence, Michael, Robert Gentleman, and Vincent Carey. 2009. “rtracklayer: an R Package for Interfacing with Genome Browsers.” Bioinformatics 25 (14) (July): 1841–1842. doi:10.1093/bioinformatics/btp328. http://dx.doi.org/10.1093/bioinformatics/btp328.

Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8) (August): e1003118. doi:10.1371/journal.pcbi.1003118. http://dx.doi.org/10.1371/journal.pcbi.1003118.

Mi, Huaiyu, Xiaosong Huang, Anushya Muruganujan, Haiming Tang, Caitlin Mills, Diane Kang, and Paul D Thomas. 2017. “PANTHER Version 11: expanded Annotation Data from Gene Ontology and Reactome Pathways, and Data Analysis Tool Enhancements.” Nucleic Acids Res. 45 (D1) (January): D183–D189. doi:10.1093/nar/gkw1138. http://dx.doi.org/10.1093/nar/gkw1138.

Pruitt, Kim D, Tatiana Tatusova, and Donna R Maglott. 2007. “NCBI Reference Sequences (RefSeq): a Curated Non-Redundant Sequence Database of Genomes, Transcripts and Proteins.” Nucleic Acids Res. 35 (Database issue) (January): D61–5. doi:10.1093/nar/gkl842. http://dx.doi.org/10.1093/nar/gkl842.

Smith, Cynthia L, Judith A Blake, James A Kadin, Joel E Richardson, Carol J Bult, and Mouse Genome Database Group. 2018. “Mouse Genome Database (MGD)-2018: knowledgebase for the Laboratory Mouse.” Nucleic Acids Res. 46 (D1) (January): D836–D842. doi:10.1093/nar/gkx1006. http://dx.doi.org/10.1093/nar/gkx1006.

Stein, L, P Sternberg, R Durbin, J Thierry-Mieg, and J Spieth. 2001. “WormBase: network Access to the Genome and Biology of Caenorhabditis Elegans.” Nucleic Acids Res. 29 (1) (January): 82–86. doi:10.1093/nar/29.1.82. https://www.ncbi.nlm.nih.gov/pubmed/11125056.

Yates, Bethan, Bryony Braschi, Kristian A Gray, Ruth L Seal, Susan Tweedie, and Elspeth A Bruford. 2017. “Genenames.Org: the HGNC and VGNC Resources in 2017.” Nucleic Acids Res. 45 (D1) (January): D619–D625. doi:10.1093/nar/gkw1033. http://dx.doi.org/10.1093/nar/gkw1033.