View
16
Download
0
Category
Preview:
Citation preview
CRSeek: a Python module for facilitating complicated CRISPR
design strategies
Will Dampier Corresp., 1, 2, 3 , Cheng-Han Chung 1, 2 , Neil T Sullivan 1, 2 , Andrew J Atkins 1, 2 , Michael R Nonnemacher 1, 2, 4 , Brian Wigdahl Corresp. 1, 2, 4
1 Department of Microbiology and Immunology, Drexel University College of Medicine, Philadelphia, PA, United States
2 Center for Molecular Virology and Translational Neuroscience, Institute for Molecular & Medicine and Infectious Disease, Drexel University College of
Medicine, Philadelphia, PA, United States3 School of Biomedical Engineering, Science, and Health Systems, Drexel University, Philadelphia, PA, United States
4 Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA, United States
Corresponding Authors: Will Dampier, Brian Wigdahl
Email address: wnd22@drexel.edu, bw45@drexel.edu
With the popularization of the CRISPR-Cas gene editing system there has been an
explosion of new techniques made possible by this versatile technology. However, the
computational field has lagged behind with a current lack of computational tools for
developing complicated CRISPR-Cas gene editing strategies. We present crseek, a Python
package that provides a consistent application programming interface (API) for multiple
cleavage prediction algorithms. Four popular cleavage prediction algorithms were
implemented and further adapted to work on draft-quality genomes. Furthermore, since
crseek mirrors the popular scikit-learn API, the package can be easily integrated as an
upstream processing module for facilitating further CRISPR-Cas machine learning research.
The package is fully integrated with the biopython package facilitating simple import,
export, and manipulation of sequences before and after gene editing. This manuscript
presents four common gene editing tasks that would be difficult with current tools but are
easily performed with the crseek package. We believe this package will help
bioinformaticians rapidly design complex CRISPR-Cas gene editing strategies and will be a
useful addition to the field.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
CRSeek: a Python module for facilitating1
complicated CRISPR design strategies2
Will Dampier1,2,3, Cheng-Han Chung1,2, Neil T. Sullivan1,2, Andrew3
Atkins1,2, Michael R. Nonnemacher1,2,4, and Brian Wigdahl1,2,44
1Department of Microbiology and Immunology, Drexel University College of Medicine,5
Philadelphia, PA, USA6
2Center for Molecular Virology and Translational Neuroscience, Institute for Molecular7
Medicine and Infectious Disease, Drexel University College of Medicine, Philadelphia,8
PA, USA9
3School of Biomedical Engineering, Science, and Health Systems, Drexel University,10
Philadelphia, Pennsylvania, USA11
4Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA, USA12
Corresponding author:13
Will Dampier and Brian Wigdahl14
Email address: wnd22@drexel.edu, bw45@drexel.edu15
ABSTRACT16
With the popularization of the CRISPR-Cas gene editing system there has been an explosion of new
techniques made possible by this versatile technology. However, the computational field has lagged
behind with a current lack of computational tools for developing complicated CRISPR-Cas gene editing
strategies. We present crseek, a Python package that provides a consistent application programming
interface (API) for multiple cleavage prediction algorithms. Four popular cleavage prediction algorithms
were implemented and further adapted to work on draft-quality genomes. Furthermore, since crseek
mirrors the popular scikit-learn API, the package can be easily integrated as an upstream processing
module for facilitating further CRISPR-Cas machine learning research. The package is fully integrated
with the biopython package facilitating simple import, export, and manipulation of sequences before
and after gene editing. This manuscript presents four common gene editing tasks that would be difficult
with current tools but are easily performed with the crseek package. We believe this package will help
bioinformaticians rapidly design complex CRISPR-Cas gene editing strategies and will be a useful addition
to the field.
17
18
19
20
21
22
23
24
25
26
27
28
29
INTRODUCTION30
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) are a family of related genes that31
function as a defense system in prokaryotes against phages and plasmid DNA (Barrangou, 2015). By32
incorporating a small portion of the invading DNA into the prokaryotic genome, it can defend against33
subsequent attacks. In this system, CRISPR-associated (Cas) genes process these genomic repeats into a34
guide RNA (gRNA). One particular region of the gRNA, termed the spacer, directs Cas endonucleases35
to cleave the target DNA at a region complementary to the spacer on foreign DNA (Marraffini and36
Sontheimer, 2010). Recently, aspects of this system have been re-engineered for researchers to easily37
use this system as a highly specific endonuclease that can be directed to edit nearly any desired DNA38
sequence across the genome (Jinek et al., 2012; Cong et al., 2013).39
The CRISPR-Cas system has revolutionized the gene editing field (Doudna and Charpentier, 2014).40
It has democratized the field by lowering the difficulty of editing a specific genomic locus (Genscript,41
2016). Various applications of CRISPR-Cas systems have been developed with useful properties such42
as an inducible expression system (Cao et al., 2016), tissue specificity (Ablain et al., 2015), as well as43
advanced editing strategies (Kim et al., 2017). In practice, targeting a specific locus with any of these44
systems simply involves finding a protospacer adjacent motif (PAM) next to a unique 20 bp protospacer45
and performing basic molecular biology techniques (Genscript, 2016; Sternberg and Doudna, 2015).46
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Recent studies have elucidated the majority of the mechanism and binding site recognition of the47
CRISPR-Cas system. Early target search strategies involved using ad-hoc rule-based methods that48
excluded potential targets by defining either a total maximum of mismatches or distinguishing between49
seed and tail regions (Fu et al., 2013; Hsu et al., 2013; Wu et al., 2014). Current targeting methods50
employ a data-driven approach by using a dataset of experimentally determined targeting probabilities and51
extracting sophisticated rules to construct scoring algorithms. Hsu et al. (2013) created a position-specific52
binding matrix by testing 700 gRNAs engineered to have intended mismatches to their target. Doench53
et al. (2016) followed this work by creating an even larger dataset of ≥80,000 gRNAs targeting many54
positions across the same gene. Klein et al. (2018) further refined this method by developing a kinetic55
model of CRISPR-Cas binding and cleavage using these datasets and showed that a small number of56
parameters can explain these experimentally determined binding rules.57
There are numerous currently employed tools for designing gRNAs reviewed by Ding et al. (2016);58
Cui et al. (2018). All of these tools use the same basic strategy. First, the target DNA is searched for PAM59
sequences and the adjacent protospacers are extracted. Next, the potential spacers are searched against a60
reference database allowing for multiple mismatches and are scored for potential off-target sites using one61
of the many scoring algorithms described above. Finally, the protospacers are ranked by their potential62
off-target risk. These tools can be served either as a webserver and occasionally as a locally installable63
toolset.64
However, all of these tools lack many of the features needed from a bioinformatic developer perspec-65
tive.66
• Webtools often limit searching to model organisms or pre-built indices.67
• Most tools cannot search for non-standard Cas9 variants such as Cpf1, SaCas9, or other species-68
specific Cas9s.69
• Currently distributed software lacks unit-testing, one-command install of the tool and dependencies,70
continuous integration frameworks, or a documented API.71
These features are needed in order for bioinformaticians to design larger scale experiments or develop72
automation methods for more advanced gene editing strategies.73
In this manuscript we present crseek (https://github.com/DamLabResources/crseek), a Python74
library mirroring the popular scikit-learn application program interface (API) proposed in Buitinck et al.75
(2013). It provides both high- and low-level methods for designing CRISPR gene editing strategies.76
crseek provides an invaluable resource for computational biologists looking to employ CRISPR-Cas77
based gene editing techniques.78
METHODS79
This manuscript presents a collection of tools for the biomedical software developer looking to design80
advanced CRISPR-Cas gene-editing strategies. This toolset was built from a developer perspective to81
aid biologists with programming experience who are looking to perform gRNA design for non-standard82
CRISPR-Cas gene editing. The toolset, and its dependencies, are installable using the Ananconda83
environment system (Anaconda, 2016).84
Terminology85
Throughout the research field there is a great deal of variability in the nomeclature of the various parts of86
the CRISPR-Cas complex. Many of these nomeclatures do not conform to the Python PEP-8 standard87
(Van Rossum et al., 2001). For consistency we refer to the complex parts in the following ways:88
• The gRNA refers to the entire strand of RNA which complexes with Cas9 to form the CRISPR-Cas89
complex. This molecule is not directly represented in crseek.90
• The spacer refers to the 20 nucleotide region of the gRNA which is used for target matching by91
the CRISPR-Cas complex. This requires an RNA alphabet.92
• The pam refers to the, potentially degenerate, nucleotide recognition site of the particular CRISPR-93
Cas system. The PAM resides adjacent to the spacer and is required for binding. This requires a94
DNA alphabet.95
2/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
• The target refers to the 20 nucleotide region of the DNA which is complementary to the spacer.96
In order to accommodate multiple CRISPR-Cas systems the target includes the PAM recognition97
site. This requires a DNA alphabet.98
• The loci refers to a ≥20 nucleotide segment of the DNA which may contain potential targets.99
The crseek tool does not make a distinction between on-target sites, those where binding is intended,100
and off-target sites, those where binding is unintended; all (spacer, target) pairs are treated equally.101
This framing is useful for expressing bleeding-edge uses of CRISPR-Cas that exploit the promiscuity of102
SpCas9 to target heterogeneous populations such as HIV (Dampier et al., 2018) or modulate microbial103
communities (Gomaa et al., 2014).104
Implementation Strategy105
We have intentionally mirrored the scikit-learn API. We subclass the BaseEstimator class and over-106
load the fit, transform, predict, and predict proba methods. This allows us to seamlessly107
use all of the tools of the scikit-searn package such as cross-validation, normalization, and other prediction108
methods (Pedregosa et al., 2011). We have decomposed this methodology into three main tasks: Searching,109
Preprocessing, and Estimating. We use the biopython library (Cock et al., 2009) to enforce appropriate110
nucleotide alphabets as inputs and outputs of the various tools.111
Searching112
We define search as the act of locating potential target sites in loci. DNA segments can be any113
files, or directories of files, readable by Bio.SeqIO (fasta, genbank, etc), lists of Bio.SeqRecord114
objects, or np.arrays of characters. The crseek tool provides two strategies for this searching:115
exhaustive or mismatch based. In an exhaustive search, all positions are considered as potential targets.116
For mismatch searching, a wrapper was created around the popular cas-offinder library (Bae et al.,117
2014). This library allows for rapid mismatch searching, even employing any graphical processors present118
and properly installed. The utils.tile seqrecord and utils.cas offinder return arrays of119
(spacer, target) pairs for downstream analysis.120
Preprocessing121
Once targets have been found on the genomic loci they must be encoded as (spacer, target) pairs for122
downstream analysis. There are two main strategies for this encoding: mismatch versus one-hot encoding.123
Rule-based mismatching and the Hsu et al. (2013) strategies both give the same weight independent of124
the mismatch identity; as such this encoding is a binary vector of length 21 (20 for the spacer, 1 for125
pam matching). One-hot encoding is used for strategies such as Doench et al. (2016) which give different126
binding penalties based on the mismatch identity; for example an A:T mismatch may have a larger penalty127
than an A:C mismatch and these penalties change with position.128
These two strategies have been encapsulated into the preprocessing.MatchingTransformer129
and preprocessing.OneHotEncoder classes. These classes take pairs of (spacer, target)130
and return binary vectors for downstream processing. By implementing these as subclasses of the131
sklearn.BaseEstimator, one can use these classes in sklearn.Pipeline instances. Down-132
stream processing can either employ one of the pre-built estimator classes or used as preprocessing133
for other machine learning algorithms.134
Estimating135
Multiple pre-built algorithms were collected for determining the activity of any given (spacer, target)136
pair. These estimators input the binary vectors produced by the preprocessing modules. A flexible137
MismatchEstimator was built that allows one to specify the seed-length, number of seed or non-seed138
mismatches, and the PAM identity. These parameters can also be loaded from user-supplied yaml files to139
allow for the exploration of non-standard CRISPR-Cas systems. The penalty scoring strategy described140
by Hsu et al. (2013) was implemented as the MITEstimator and the Doench et al. (2016) method141
as the CFDEstimator. The newly proposed kinetic model developed by Klein et al. (2018) has also142
been implemented as the KineticEstimator. New estimators can be easily added by subclassing the143
SequenceBase abstract base class and overloading the relevant methods.144
3/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Additional Features145
In order to account for many of the idiosyncrasies of designing CRISPR-Cas tools, a few optional features146
were added. Degenerate bases can be used across all tools. These bases are assumed to be each relevant147
nucleotide in equal likelihoods and the penalty scores are scaled accordingly (e.g. an R is 50% likely to be148
an A or G). SAM/BAM files can be used as input in all search tools. This allows for search and estimation149
across variable populations. This is useful when targeting microbiomes, metagenomes, or highly mutable150
viral genomes.151
In vitro validation of Cas9 cleavage of mixed populations152
In order to evaluate whether the crseek module can be used for novel analyses, an in vitro cutting assay153
using mixed populations of sequences from HIV-1-infected individuals was adapted. Genomic DNA154
was extracted from PMBCs derived from five patients in the Drexel Medicine CNS AIDS Research and155
Eradication Study (CARES). The Long Terminal Repeat (LTR) region was amplified using a two-round156
PCR strategy as previously described (Li et al., 2011). Bulk PCR products were cloned into the pGL3157
basic expression vector as previously described (Li et al., 2015) obtaining one to six unique sequences per158
patient. Additionally, sequences representing the consensus LTRs from CXCR4- (X4) and CCR5-utilizing159
(R5) viruses constructed from previous analyses (Antell et al., 2016), were synthesized and cloned into the160
pGL3 vector by VectorBuilder (Cyagen Biosciences). The LTR-A gRNA (ATCAGATATCCACTGACCTT161
NGG) from Hu et al. (2014) was selected for analysis, purchased from IDT, and in vitro transcribed using162
the EnGen sgRNA synthesis procedure (NEB) as described by the manufacturer protocol. The gRNA163
was transcribed, purified, and concentrated using the RNA clean and concentrator procedure provided by164
Zymo Research.165
The in vitro digestion of patient LTRs was performed by following NEB in vitro digestion of DNA with166
Cas9 nuclease protocol with no modifications. The multiple unique sequences cloned into pGL3 vectors167
were added in equimolar ratios to mirror the biological context of HIV-1 infection. Cutting reactions168
were performed for 1 hour at 37°C followed by 65°C heat-inactivation for 5 min. BamHI restriction169
endonuclease digestion was performed to linearize any undigested plasmid and samples were cleaned170
using the Qiagen PCR cleanup procedure to ensure proper running on the High Sensitivity Bioanalyzer171
chip (Agilent) which was used to measure the sizes and molarities of the fragments.172
As the sizes reported by the Bioanalyzer have a known degree of noise, a mean shift clustering173
algorithm was used to identify the appropriate cutoffs to use when assigning a Bioanalyzer peak to a174
particular molecular fragment. We use a width of 500 bp due to the estimate of 10% error in the length175
estimation described in technical specifications provided by Agilent. The two fragment sizes resulting176
from the Cas9 in vitro digestion were calculated in a similar way. The calculation of cleavage efficiency,177
the mean of the molarity of two Cas9-digested fragments (designated as mol f 1 and mol f 2 in Equation178
1) was used as the molarity for the fragmented pGL3. The molarity of the total pGL3 was calculated by179
adding the molarity of fragmented pGL3 and the molarity of unfragmented pGL3 (designated as molun).180
The function of cleavage efficiency can be described as:181
cleavage =(mol f 1+mol f 2)/2
molun+(mol f 1+mol f 2)/2(1)
The processed data is shown in Table 1.182
RESULTS183
In order to showcase the many uses of the crseek module, this manuscript includes four simple and184
realistic CRISPR-Cas design tasks. These tasks would not be reasonably possible with currently available185
tools due to either the draft quality of the template genome or the complexity of the design.186
Task 1187
Creating positive controls for CRISPR library screens are critical to downstream validation. As an188
illustrative example, imagine developing a gRNA that can knockout the expression of eGFP from a189
plasmid (Addgene-54622, (Davidson, 2016)) transformed into the newly sequenced Clostridioides difficile190
genome (Genbank: NC 009089.1, Monot et al. (2011)). A robust positive control will have a perfect191
homology with the eGFP locus while having a low homology with any other position in the C. difficile192
genome. The following script shows a basic implementation.193
4/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
# Build the CFD estimator
estimator = estimators.CFDEstimator.build_pipeline()
# Load the relevant sequences
genome_path = 'data/Clostridioides_difficile_630.gb'
plasmid_path = 'data/addgene-plasmid-54622-sequence-158642.gbk'
with open(genome_path) as handle:
genome = list(SeqIO.parse(handle, 'genbank'))[0]
with open(plasmid_path) as handle:
plasmid = list(SeqIO.parse(handle, 'genbank'))[0]
egfp_feature = get_egfp(plasmid)
egfp_record = egfp_feature.extract(plasmid)
# Extract all possible NGG targets
possible_targets = utils.extract_possible_targets(egfp_record,
pams=('NGG',))
# Find all targets across the host genome
possible_binding = utils.cas_offinder(possible_targets,
5, locus = [genome])
# Score each hit across the genome
scores = estimator.predict_proba(possible_binding.values)
possible_binding['Score'] = scores
# Find the maximum score for each spacer
results = possible_binding.groupby('spacer')['Score'].agg('max')
results.sort_values(inplace=True)
Utilizing this simple script reveals that there are numerous potential spacers. The biopython194
GenomeDiagram module can be used to generate a proposed plasmid map as shown in Fig. 1. This is195
not an unexpected result as the C. difficile genome does not contain genes that share significant homology196
with eGFP, as such this represents simple application of the crseek module. Aside from the draft-quality197
of the genome, this design could be accomplished with many of the currently available tools.198
Task 2199
The crseek module is also capable of more complex spacer design tasks. The following code snippet200
describes the generation of a large library of spacers that target individuals genes in the C. difficile201
genome. The tight integration of crseek with the biopython package allows for simple reading and202
writing of the numerous file types employed in the biological field. Furthermore, with the crseek203
module it is simple to ensure that the spacers designed for a particular gene do not have homology204
elsewhere in the genome.205
library_grnas = []
gene_info = []
genome_key = genome.id + ' ' + genome.description
for feat in genome.features:
# Only target genes
if feat.type != 'CDS':
continue
# Get info about this gene
5/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
product = feat.qualifiers['product'][0]
tag = feat.qualifiers['locus_tag'][0]
gene_record = feat.extract(genome)
# Get potential spacers
possible_targets = utils.extract_possible_targets(gene_record)
# Score hits
possible_binding = utils.cas_offinder(possible_targets,
3, locus = [genome])
scores = estimator.predict_proba(possible_binding.values)
possible_binding['Score'] = scores
# Set hits within the gene to np.nan
slic = pd.IndexSlice[genome_key, :,
feat.location.start:feat.location.end]
possible_binding.loc[slic, 'Score'] = np.nan
# Aggregate and sort off-target scan
grouped_spacers = possible_binding.groupby('spacer')
ot_scores = grouped_spacers['Score'].agg('max')
# Spacers which only have intragenic hit
ot_scores.fillna(0, inplace=True)
ot_scores.sort_values(inplace=True)
# Save information
gene_info.append({'Product': product,
'Tag': tag,
'UsefulGuides': (ot_scores<=0.25).sum()})
top = ot_scores.head()
for protospacer, off_score in top.to_dict().items():
dna_target = protospacer.back_transcribe()
strand = '+'
if location == -1:
dna_target = reverse_complement(dna_target)
strand = '-'
location = genome.seq.find(dna_target)
library_grnas.append({'Product': product,
'Tag': tag,
'Protospacer': protospacer,
'Location': location,
'Strand': strand,
'Off Target Score': off_score})
gene_df = pd.DataFrame(gene_info)
library_df = pd.DataFrame(library_grnas)
This analysis shows that of the 3751 annotated genes in this draft C. difficile genome, 3506 (93.4%)206
of them can be targeted by at least five spacers without off-target effects. Fig. 2 shows the selected207
spacers along with the genes that are targeted. The design of the crseek library facilitates this208
analysis with only three lines of CRISPR-specific code.209
6/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Task 3210
The crseek module is flexible enough to facilitate the analysis of novel experimental designs. Popula-211
tions of mixed HIV-1 LTR-containing plasmids were subjected to an in vitro cutting assay, as described212
above, in an effort to explore the most appropriate way of predicting SpCas9 effectiveness against a mixed213
population of DNA fragments. This mirrors the biological situation found in HIV-1-infected individu-214
als who often contain a swarm of related, yet genetically distinct, integrated viral genomes (Dampier215
et al., 2017). The script below describes how to compare the predictions of three popular methods with216
experimentally determined cutting profiles.217
# Load in the known data
cutting_data = pd.read_csv('data/IVCA_gRNAs_efficiency.csv')
# Text objects need to be converted to Bio.Seq
# objects with the correct Alphabet
ct = utils.smrt_seq_convert('Seq',
cutting_data['Target'].values,
alphabet=Alphabet.generic_dna)
cutting_data['target'] = list(ct)
sd = utils.smrt_seq_convert('Seq',
cutting_data['gRNA Sequence'].values,
alphabet=Alphabet.generic_dna)
cutting_data['spacer'] = [s.transcribe() for s in sd]
# Predict the score using each model across all sequences
ests = [('MIT', estimators.MITEstimator.build_pipeline()),
('CFD', estimators.CFDEstimator.build_pipeline()),
('Kinetic', estimators.KineticEstimator.build_pipeline())]
for name, est in ests:
# Models only require the gRNA and Target sequence columns
data = cutting_data[['spacer', 'target']].values
cutting_data[name] = est.predict_proba(data)
# Aggregate the predicted cutting data across each sample
predicted_cleavage = pd.pivot_table(cutting_data,
index = 'Patient',
columns = 'gRNA Name',
values = ['MIT', 'CFD',
'Kinetic'],
aggfunc = 'mean')
After aggregation, these results can be plotted against the known results as shown in Fig. 3. In these218
studies, the CFD model has the best correlation between the predicted and observed results (r2 = 0.90)219
when compared to an r2 = 0.81 for the MIT model and r2 = 0.83 for the Kinetic model.220
Task 4221
The extensible nature of the crseek module allows simple integration with the scikit-learn archi-222
tecture. The crseek.preprocessing transformers can be used to easily extract features from223
target, spacer pairs that can be used as inputs for scikit-learn prediction models. Addition-224
ally, the crseek.estimators subclass the sklearn.BaseEstimator and implement .fit,225
.predict, and .predict proba methods allowing them to be used interchangeably with native226
scikit-learn models. The script below shows how crseek tools can be used to develop new machine227
learning methods for identifying interaction rules as well as comparing the results to currently published228
methodologies.229
7/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
files = sorted(glob.glob('data/GUIDESeq/*/*.tsv'))
dfs = [pd.read_csv(f, sep='\t').reset_index() for f in files]
hit_data = pd.concat(dfs,
axis=0, ignore_index=True)
# Multiple data loading/cleaning steps have been omitted.
# See the Github repository files for the entire script.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate, StratifiedKFold
# Convert spacer-target pairs into a binary "matching" vector
transformer = preprocessing.MatchingTransformer()
X = transformer.transform(hit_data[['spacer', 'target']].values)
y = (hit_data['NormSum'] >= cutoff).values
# Evaluate the module using 3-fold cross-validation
grad_res = cross_validate(GradientBoostingClassifier(), X, y,
scoring = ['accuracy',
'precision',
'recall'],
cv = StratifiedKFold(random_state=0),
return_train_score=True)
grad_res = pd.DataFrame(grad_res)
# Evaluate MITEstimator using 3-fold cross-validation
mit_res = cross_validate(estimators.MITEstimator(), X, y,
scoring = ['accuracy',
'precision',
'recall'],
cv = StratifiedKFold(random_state=0),
return_train_score=True)
mit_res = pd.DataFrame(mit_res)
The prediction models that are part of the crseek module can be directly used in the scikit-learn230
evaluation functions like cross validate. Additionally, the preprocessing tools can be used231
to convert (spacer, target) pairs into a feature space that is directly usable by other scikit-learn232
models. Fig. 4 shows the results of exploring simple, built-in, scikit-learn models utilizing data from a233
GUIDE-Seq experiment by Tsai et al. (2015). While these results are preliminary, there is evidence that234
more advanced machine learning methods such as gradient boosting and feed-forward neural networks235
may have improved prediction capabilities.236
Discussion237
As the CRISPR-Cas gene-editing field expands, the number of potential strategies will increase as well.238
It is not feasible to build single purpose-driven tools to account for all of these potential strategies. For239
example, nickase-based strategies require finding targets with potential spacers on each strand at the same240
position (Shen et al., 2014), while employing homologous recombination gene editing strategies requires241
finding targets that are a specific distance away (Chen et al., 2013). While it may be easy to develop242
single instances of these strategies by hand, it would be impracticable to do so across many, potentially243
thousands, of genes. As such, a modular architecture was employed to allow bioinformaticians to use244
individual parts of the library as needed to leverage the power of Python to design gene-editing strategies.245
The crseek module can also be used to explore the use of CRISPR-Cas gene-editing in non-model246
organisms. Even ”draft quality” genome sequences containing degenerate bases can be searched. This247
further includes the use of differing CRISPR-Cas systems that can be explored easily. Researchers can248
also integrate crseek into the scikit-learn ecosystem to develop more advanced estimators as more data249
8/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
becomes available.250
The crseek module can be used to evaluate the effectiveness of CRISPR-Cas gene in rapidly251
evolving bleeding-edge use-cases of the CRISPR-Cas system. This will be an invaluable tool for fields252
editing highly mutable genomes and meta-genomes of mixed populations. In the case of HIV-1, recent253
research has shown that inter-patient variability presents a barrier to effective targeting (Dampier et al.,254
2017; Roychoudhury et al., 2018). However, utilizing this tool it was possible to exploit the promiscuity255
of SpCas9 to construct a set spacers capable of targeting the wide range of known mutations (Dampier256
et al., 2018). Additionally, CRISPR gene-editing has been proposed as a strategy for shifting the human257
microbiome by selectively targeting specific species (Bikard and Barrangou, 2017; Gomaa et al., 2014).258
The crseekmodule can measure the effectiveness across each composed metagenome and accommodate259
arbitrarily complicated targeting rules (eg. target genomes A-E, miss genomes F-J).260
Lastly, the crseek module will be useful to researchers exploring the binding rules of non-model261
Cas9 variants. Task 4 shows the ease of training a new estimator given a set of binding data. Using a Cas9262
variant in a genome-wide binding experiment such as GUIDE-Seq (Tsai et al., 2015), CIRCLE-Seq (Tsai263
et al., 2017), or SITE-Seq (Cameron et al., 2017), can provide invaluable training data. The crseek264
module can be easily merged with the scikit-learn architecture or even more advanced deep learning265
frameworks such as Keras (Chollet et al., 2015) or Tensorflow (Abadi et al., 2015).266
Conclusion267
By designing a consistent API that mirrors other common tools in the Python scientific stack, crseek268
can be used to automate arbitrarily complicated gene editing strategies with minimal effort. Along with269
the multitude of implemented features, crseek is an invaluable resource for computational biologists270
exploring the intricacies of CRISPR-Cas gene editing.271
ACKNOWLEDGMENTS272
We would like to acknowledge the patients of the Drexel Medicine CARES cohort for providing material273
for this study. Patients in the Drexel CNS AIDS Research and Eradication Study (CARES) Cohort were274
recruited from the Partnership Comprehensive Care Practice of the Division of Infectious Disease and275
HIV Medicine at Drexel University College of Medicine (Drexel Medicine), located in Philadelphia,276
Pennsylvania. Patients in the Drexel Medicine CARES cohort were recruited under protocol 1201000748277
(Brian Wigdahl, PI), which adheres to the ethical standards of the Helsinki Declaration (1964, amended278
most recently in 2008). All patients provided written consent upon enrollment.279
REFERENCES280
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean,281
J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R.,282
Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster,283
M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas,284
F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow:285
Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.286
Ablain, J., Durand, E. M., Yang, S., Zhou, Y., and Zon, L. I. (2015). A CRISPR/Cas9 vector system for287
tissue-specific gene disruption in zebrafish. Developmental cell, 32(6):756–64.288
Anaconda (2016). Anaconda software distribution.289
Antell, G., Dampier, W., Aiamkitsumrit, B., Nonnemacher, M., Jacobson, J., Pirrone, V., Zhong, W.,290
Kercher, K., Passic, S., Williams, J., Schwartz, G., Hershberg, U., Krebs, F., and Wigdahl, B. (2016).291
Utilization of HIV-1 envelope V3 to identify X4- and R5-specific Tat and LTR sequence signatures.292
Retrovirology, 13(1).293
Bae, S., Park, J., and Kim, J.-S. (2014). Cas-OFFinder: a fast and versatile algorithm that searches294
for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics (Oxford, England),295
30(10):1473–5.296
Barrangou, R. (2015). The roles of CRISPR–Cas systems in adaptive immunity and beyond. Current297
Opinion in Immunology, 32:36–41.298
Bikard, D. and Barrangou, R. (2017). Using CRISPR-Cas systems as antimicrobials. Current Opinion in299
Microbiology, 37:155–160.300
9/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer,301
P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. (2013).302
API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD303
Workshop: Languages for Data Mining and Machine Learning, pages 108–122.304
Cameron, P., Cameron, P., Settle, A. H., Fuller, C. K., Thompson, M. S., Cigan, A. M., Young, J. K., and305
May, A. P. (2017). SITE-Seq: A Genome-wide Method to Measure Cas9 Cleavage. Protocol Exchange.306
Cao, J., Wu, L., Zhang, S.-M., Lu, M., Cheung, W. K. C., Cai, W., Gale, M., Xu, Q., and Yan, Q. (2016).307
An easy and efficient inducible CRISPR/Cas9 platform with improved specificity for multiple gene308
targeting. Nucleic acids research, 44(19):e149.309
Chen, C., Fenk, L. A., and de Bono, M. (2013). Efficient genome editing in Caenorhabditis elegans by310
CRISPR-targeted homologous recombination. Nucleic Acids Research, 41(20):e193–e193.311
Chollet, F. et al. (2015). Keras.312
Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck,313
T., Kauff, F., Wilczynski, B., and de Hoon, M. J. L. (2009). Biopython: freely available Python314
tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England),315
25(11):1422–3.316
Cong, L., Ran, F. A., Cox, D., Lin, S., Barretto, R., Habib, N., Hsu, P. D., Wu, X., Jiang, W., Marraffini,317
L. A., and Zhang, F. (2013). Multiplex genome engineering using CRISPR/Cas systems. Science (New318
York, N.Y.), 339(6121):819–23.319
Cui, Y., Xu, J., Cheng, M., Liao, X., and Peng, S. (2018). Review of CRISPR/Cas9 sgRNA Design Tools.320
Interdisciplinary Sciences: Computational Life Sciences, 10(2):455–465.321
Dampier, W., Sullivan, N., Chung, C.-H., Mell, J., Nonnemacher, M., and Wigdahl, B. (2017). Designing322
broad-spectrum anti-HIV-1 gRNAs to target patient-derived variants. Scientific Reports, 7(1).323
Dampier, W., Sullivan, N. T., Mell, J., Pirrone, V., Ehrlich, G., Chung, C.-H., Allen, A. G., DeSimone,324
M., Zhong, W., Kercher, K., Passic, S., Williams, J., Szep, Z., Khalili, K., Jacobson, J., Nonnemacher,325
M. R., and Wigdahl, B. (2018). Broad spectrum and personalized gRNAs for CRISPR/Cas9 HIV-1326
therapeutics. AIDS Research and Human Retroviruses, page AID.2017.0274.327
Davidson, M. (2016). megfp-pbad.328
Ding, Y., Li, H., Chen, L.-L., and Xie, K. (2016). Recent Advances in Genome Editing Using CRISPR/-329
Cas9. Frontiers in Plant Science, 7:703.330
Doench, J. G., Fusi, N., Sullender, M., Hegde, M., Vaimberg, E. W., Donovan, K. F., Smith, I., Tothova,331
Z., Wilen, C., Orchard, R., Virgin, H. W., Listgarten, J., and Root, D. E. (2016). Optimized sgRNA332
design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology,333
34(2):184–191.334
Doudna, J. A. and Charpentier, E. (2014). The new frontier of genome engineering with CRISPR-Cas9.335
Fu, Y., Foden, J. A., Khayter, C., Maeder, M. L., Reyon, D., Joung, J. K., and Sander, J. D. (2013).336
High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nature337
Biotechnology, 31(9):822–826.338
Genscript (2016). CRISPR Handbook. GenScript, (October):1–32.339
Gomaa, A. A., Klumpe, H. E., Luo, M. L., Selle, K., Barrangou, R., and Beisel, C. L. (2014). Pro-340
grammable removal of bacterial strains by use of genome-targeting CRISPR-Cas systems. mBio,341
5(1):00928–13.342
Hsu, P. D., Scott, D. A., Weinstein, J. A., Ran, F. A., Konermann, S., Agarwala, V., Li, Y., Fine, E. J.,343
Wu, X., Shalem, O., Cradick, T. J., Marraffini, L. A., Bao, G., and Zhang, F. (2013). DNA targeting344
specificity of RNA-guided Cas9 nucleases. Nature Biotechnology, 31(9):827–832.345
Hu, W., Kaminski, R., Yang, F., Zhang, Y., Cosentino, L., Li, F., Luo, B., Alvarez-Carbonell, D., Garcia-346
Mesa, Y., Karn, J., Mo, X., and Khalili, K. (2014). RNA-directed gene editing specifically eradicates347
latent and prevents new HIV-1 infection. Proceedings of the National Academy of Sciences of the348
United States of America, 111(31):11461–6.349
Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J. A., and Charpentier, E. (2012). A pro-350
grammable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science (New York,351
N.Y.), 337(6096):816–21.352
Kim, Y. B., Komor, A. C., Levy, J. M., Packer, M. S., Zhao, K. T., and Liu, D. R. (2017). Increasing the353
genome-targeting scope and precision of base editing with engineered Cas9-cytidine deaminase fusions.354
Nature biotechnology, 35(4):371–376.355
10/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Klein, M., Eslami-Mossallam, B., Arroyo, D. G., and Depken, M. (2018). Hybridization Kinetics Explains356
CRISPR-Cas Off-Targeting Rules. Cell Reports, 22(6):1413–1423.357
Li, L., Aiamkitsumrit, B., Pirrone, V., Nonnemacher, M. R., Wojno, A., Passic, S., Flaig, K., Kilareski, E.,358
Blakey, B., Ku, J., Parikh, N., Shah, R., Martin-Garcia, J., Moldover, B., Servance, L., Downie, D.,359
Lewis, S., Jacobson, J. M., Kolson, D., and Wigdahl, B. (2011). Development of co-selected single360
nucleotide polymorphisms in the viral promoter precedes the onset of human immunodeficiency virus361
type 1-associated neurocognitive impairment. Journal of NeuroVirology, 17(1):92–109.362
Li, L., Parikh, N., Liu, Y., Kercher, K., Dampier, W., Flaig, K., Aiamkitsumrit, B., Passic, S., Frantz,363
B., Blakey, B., Pirrone, V., Moldover, B., Jacobson, J., Feng, R., Nonnemacher, M., and Wigdahl,364
B. (2015). Impact of Naturally Occurring Genetic Variation in the HIV-1 LTR TAR Region and Sp365
Binding Sites on Tat-Mediated Transcription. Journal of Human Virology & Retrovirology, 2(4).366
Marraffini, L. A. and Sontheimer, E. J. (2010). CRISPR interference: RNA-directed adaptive immunity in367
bacteria and archaea. Nature Reviews Genetics, 11(3):181–190.368
Monot, M., Boursaux-Eude, C., Thibonnier, M., Vallenet, D., Moszer, I., Medigue, C., Martin-Verstraete,369
I., and Dupuy, B. (2011). Reannotation of the genome sequence of Clostridium difficile strain 630.370
Journal of Medical Microbiology, 60(8):1193–1199.371
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer,372
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and373
Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning374
Research, 12:2825–2830.375
Roychoudhury, P., De Silva Feelixge, H., Reeves, D., Mayer, B. T., Stone, D., Schiffer, J. T., and Jerome,376
K. R. (2018). Viral diversity is an obligate consideration in CRISPR/Cas9 designs for targeting the377
HIV reservoir. BMC Biology, 16(1):75.378
Shen, B., Zhang, W., Zhang, J., Zhou, J., Wang, J., Chen, L., Wang, L., Hodgkins, A., Iyer, V., Huang, X.,379
and Skarnes, W. C. (2014). Efficient genome modification by CRISPR-Cas9 nickase with minimal380
off-target effects. Nature Methods, 11(4):399–402.381
Sternberg, S. H. and Doudna, J. A. (2015). Expanding the Biologist’s Toolkit with CRISPR-Cas9.382
Tsai, S. Q., Nguyen, N. T., Malagon-Lopez, J., Topkar, V. V., Aryee, M. J., and Joung, J. K. (2017).383
CIRCLE-seq: a highly sensitive in vitro screen for genome-wide CRISPR-Cas9 nuclease off-targets.384
Nature methods, 14(6):607–614.385
Tsai, S. Q., Zheng, Z., Nguyen, N. T., Liebers, M., Topkar, V. V., Thapar, V., Wyvekens, N., Khayter,386
C., Iafrate, A. J., Le, L. P., Aryee, M. J., and Joung, J. K. (2015). GUIDE-seq enables genome-wide387
profiling of off-target cleavage by CRISPR-Cas nucleases. Nature biotechnology, 33(2):187–197.388
Van Rossum, G., Warsaw, B., and N., C. (2001). Pep 8 – style guide for python code.389
Wu, X., Scott, D. A., Kriz, A. J., Chiu, A. C., Hsu, P. D., Dadon, D. B., Cheng, A. W., Trevino, A. E.,390
Konermann, S., Chen, S., Jaenisch, R., Zhang, F., and Sharp, P. A. (2014). Genome-wide binding of391
the CRISPR endonuclease Cas9 in mammalian cells. Nature biotechnology, 32(7):670–6.392
FIGURES & TABLES393
11/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Patient Clone Target Cleavage Efficiency
R5 1 ATCAGATATCCACTGACCTT TGG 73%
X4 1 ACCAGATATCCACTGTGCTT TGG 6%
A0019
1 ACAAGATTTCCACTGACCTT TGG
56%2 ACAAGATTTCCACTGACCTT TGG
A0020
1 ACCAGATTTCCCCTGACCTT TGG
25%2 ACCAGATTTCCCCTGACCTT TGG
A0049
1 GTCAGATACCCACTGACCTT TGG
91%2 GTCAGATACCCACTGACCTT TGG
3 GTCAGATATCCACTGACCTT TGG
A0050 1 ACCAGGTTTCCACTGACCTT TGG 58%
A0107
1 GTCAGATATCCACTGACCTT TGG
84%
2 GTCAGATATCCACTGACCTT TGG
3 GTCAGATATCCACTGACCTT TGG
4 GTCAGATATCCACTGACCTT TGG
5 ACTAGATGGCCACTGACCTT TGG
6 GTCAGATATCCACTGACCTT TGG
Table 1. Sequences used in the in vitro cutting assay. The Target column refers to the target region of
the cloned sequence with the highest match to the LTR-A gRNA used for cleavage
(ATCAGATATCCACTGACCTT NGG) in the LTR (loci). The pam region is bolded. The Cleavage
Efficiency refers to the percentage of the plasmids that were cleaved as measured by the BioAnalyzer.
12/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Amp-R
ori
EGFP
araC
T7 Tag
Figure 1. A plasmid map of mEGFP-pBAD with the gene and promoter features labeled in blue and
light blue. Potential spacer-targeting sites are shown in green.
13/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Designable genes
Undesignable genes
Spacers
Figure 2. Approximately 93% of genes in the C. difficile genome can be targeted by at least 5 spacers
without predicted off-target effects. Gene regions are shown along the inner track with those in blue
representing those for which at least five spacers could be designed while those in red could not be
properly targeted. The outer track shows the location of spacers in green.
14/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Observed Cleavage 0 0.25 0.50 0.75 1.0
Observed Cleavage 0 0.25 0.50 0.75 1.0
Observed Cleavage 0 0.25 0.50 0.75 1.0
Pre
dic
ted
Cle
ava
ge
0
0.2
0.4
0.6
0.8
1.0
A B C MIT CFD Kinetic
r2=0.81 r2=0.90 r2=0.83
Figure 3. Prediction models have different levels of agreement with in vitro data. Three different
prediction models were used to predict the cleavage percentage of a mixed population of targets. Each
point represents the percentage of cleavage observed on a population of distinct genetic variants
compared to the predicted percentage of cleavage. The line indicates the regression line, while the
shadows indicate the 95% confidence intervals of 1000 bootstrap replicates. A) The predictions were
performed using the algorithm proposed by Hsu et al. (2013). B) The predictions were performed using
the algorithm proposed by Doench et al. (2016) C) The predictions were performed using the algorithm
proposed by Klein et al. (2018).
15/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
MIT
CF
D
Kin
etic
Gra
die
nt B
oo
stin
g
Ne
are
st N
eig
hb
ors
Ne
ura
l N
etw
ork
Me
an
Ab
so
lute
Err
or
0
0.05
0.10
0.15
0.20
0.25
Figure 4. Advanced machine learning techniques may be useful in improving the agreement between
predicted and observed cleavage. Each bar indicates the mean absolute error between the predicted
cleavage rate and the observed cleavage rate measured using GUIDE-Seq by Tsai et al. (2015).
Scikit-learn models were fit using default parameters on the output of the
crseek.preprocessing.MisMatchEstimator.
16/16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27094v1 | CC BY 4.0 Open Access | rec: 6 Aug 2018, publ: 6 Aug 2018
Recommended