Machine learning for neural decoding - arXiv · We will describe a number of specific methods suitable for neural decoding later in this tutorial. Ultimately, it is often beneficial

Machinelearningforneuraldecoding

JoshuaI.Glaser1,2,6,8,9*,AriS.Benjamin6,RaeedH.Chowdhury3,4,MatthewG.Perich3,4,LeeE.Miller2-4,andKonradP.Kording2-7

1. InterdepartmentalNeuroscienceProgram,NorthwesternUniversity,Chicago,IL,USA2. DepartmentofPhysicalMedicineandRehabilitation,NorthwesternUniversityandShirleyRyanAbilityLab,Chicago,IL,USA3. DepartmentofPhysiology,NorthwesternUniversity,Chicago,IL,USA4. DepartmentofBiomedicalEngineering,NorthwesternUniversity,Chicago,IL,USA5. DepartmentofAppliedMathematics,NorthwesternUniversity,Chicago,IL,USA6. DepartmentofBioengineering,UniversityofPennsylvania,Philadelphia,IL,USA7. DepartmentofNeuroscience,UniversityofPennsylvania,Philadelphia,IL,USA8. DepartmentofStatistics,ColumbiaUniversity,NewYork,NY,USA9. ZuckermanMindBrainBehaviorInstitute,ColumbiaUniversity,NewYork,NY,USA*Contact:[email protected]:Despiterapidadvancesinmachinelearningtools,themajorityofneuraldecodingapproachesstillusetraditionalmethods.Modernmachinelearningtools,whichareversatileandeasytouse,havethe potential to significantly improve decoding performance. This tutorial describes how toeffectively apply these algorithms for typical decoding problems. We provide descriptions, bestpractices,andcodeforapplyingcommonmachinelearningmethods,includingneuralnetworksandgradientboosting.Wealsoprovidedetailedcomparisonsoftheperformanceofvariousmethodsatthe task of decoding spiking activity in motor cortex, somatosensory cortex, and hippocampus.Modernmethods,inparticularneuralnetworksandensembles,significantlyoutperformtraditionalapproaches, such as Wiener and Kalman filters. Improving the performance of neural decodingalgorithms allows neuroscientists to better understand the information contained in a neuralpopulation,andcanhelpadvanceengineeringapplicationssuchasbrainmachineinterfaces.IntroductionNeuraldecodingusesactivityrecordedfromthebraintomakepredictionsaboutvariables intheoutsideworld.Forexample,researcherspredictmovementsbasedonactivityinmotorcortex[1,2],decisionsbasedonactivityinprefrontalandparietalcortices[3,4],andspatiallocationsbasedonactivityinthehippocampus[5,6].Thesedecodingpredictionscanbeusedtocontroldevices(e.g.,aroboticlimb),ortobetterunderstandhowareasofthebrainrelatetotheoutsideworld.Decodingisacentraltoolinneuralengineeringandforneuraldataanalysis.In essence, neural decoding is a regression (or classification) problem relating neural signals toparticularvariables.Whenframingtheprobleminthisway,itisapparentthatthereisawiderangeofmethodsthatonecouldapply.However,despitetherecentadvances inmachine learning(ML)techniques for regression, it is still common to decode activitywith traditionalmethods such as

linear regression. Using modern ML tools for neural decoding has the potential to boostperformancesignificantly,andmightallowdeeperinsightsintoneuralfunction.This tutorial is designed to help readers start applying standardMLmethods for decoding.Wedescribewhenoneshould(orshouldnot)useMLfordecoding,howtochooseanMLmethod,andbestpractices suchas cross-validationandhyperparameteroptimization.Weprovidecompanioncode thatmakes itpossible to implementavarietyofdecodingmethodsquickly.Using thissamecode,wedemonstrateherethatMLmethodsoutperformtraditionaldecodingmethods.Inexampledatasets of recordings from monkey motor cortex, monkey sensorimotor cortex, and rathippocampus,modernMLmethods showed the highest accuracy decoding of availablemethods.Using our code and this tutorial, readers can achieve these performance improvements on theirowndata.Whyusemachinelearningfordecoding?Ingeneral,modernmachine learninghasbeenshown to improvepredictiveperformancegreatly[7-9]. Based on this evidence of improved predictions (which we will demonstrate later in thistutorial inAdemonstration and comparison ofmachine learning for decoding), there aremultiplereasonstouseMLtodecodeneuralactivity[10].EngineeringapplicationsDecodingisoftenusedinengineeringcontexts,suchasforbrainmachineinterfaces(BMIs),wheresignalsfrommotorcortexareusedtocontrolcomputercursors[1],roboticarms[11],andmuscles[2].Whentheprimaryaimoftheseengineeringapplicationsistoimprovepredictiveaccuracy,MLshouldgenerallybebeneficial1.UnderstandingwhatinformationiscontainedinneuralactivityDecodingisalsoanimportanttoolforunderstandinghowneuralsignalsrelatetotheoutsideworld.It can be used to determine how much information neural activity contains about an externalvariable(e.g.,sensationormovement)[12-14],andhowthisinformationdiffersacrossbrainareas[15-17],experimental conditions [18,19], anddiseasestates [20].When thegoal is todeterminehowmuchinformationaneuralpopulationhasaboutanexternalvariable,regardlessoftheformofthatinformation,thenusingMLwillbebeneficial.While modernMLmethods can provide an increase in decoding accuracy, it is important to becareful with the scientific interpretation of decoding results. Decoding can tell us how muchinformationaneuralpopulationhasaboutavariableX.However,highdecodingaccuracydoesnotmeanthatabrainareaisdirectlyinvolvedinprocessingX,orthatXisthepurposeofthebrainarea.Forexample,withapowerfuldecoder, itcouldbepossibletoaccuratelyclassify imagesbasedon

1 Oneproblemthatisoftenhypothesizedforonlineapplications,butwhosemagnitudeisunclear,isthatcomplicatedmachinelearningalgorithmscouldmakeitharderforuserstolearntousetheirdevice.

recordings from the retina, since the retinahas informationaboutall visual space.However, thisdoesnotmeanthattheprimarypurposeoftheretinaisimageclassification.Moreover,eveniftheneural signal temporally precedes the external variable, it does not necessarily mean that it iscausally involved.Forexample,movement-related informationcouldreachsomatosensorycortexpriortomovementduetoanefferencecopyfrommotorcortex,ratherthansomatosensorycortexbeingresponsibleformovementgeneration.Thus,researchersshouldconstraininterpretationstomerely address the information in neural populations about variables of interest, and how thisinformationmayvaryacrossbrainregions,experimentalconditions,ortime.BenchmarkingforsimplerdecodingmodelsDecoders can also be used to understand the form of themapping between neural activity andvariablesintheoutsideworld[21,22].Thatis,ifresearchersaimtotestifthemappingfromneuralactivity to behavior/stimuli (the “neural code”) has a certain structure, they can develop a“hypothesis-driven decoder”with a specific form. If that decoder can predict task variableswithsomearbitrary accuracy level, this is sometimesheld as evidence that informationwithinneuralactivity indeed has the hypothesized structure. However, it is important to know how well ahypothesis-drivendecoderperformsrelativetowhatispossible.ThisiswheremodernMLmethodscan be of use. If a method designed to test a hypothesis decodes activity much worse thanMLmethods, then a researcher knows that their hypothesis likelymisses key aspects of the neuralcode.Hypothesis-drivendecoders should thus always be compared against a good-faith effort tomaximizeperformanceaccuracywithagoodmachinelearningapproach.WhatdecodershouldIusetoimprovepredictiveperformance?Dependingontherecordingmethod,location,andvariablesofinterest,differentdecodingmethodsmay be more effective. Neural networks, gradient boosted trees, support vector machines, andlinearmethods are among thedozens of potential candidates. Eachmakesdifferent assumptionsabout how inputs relate to outputs.Wewill describe a number of specificmethods suitable forneural decoding later in this tutorial. Ultimately, it is often beneficial to test multiple methods,perhapsstartingwiththemethodswehavefoundtoworkbestforourdemonstrationdatasets.DifferentmethodsmakedifferentimplicitassumptionsaboutthedataThe mapping between neural activity and task variables is likely to consistently have certainproperties. Forexample, themapping is likely smoothwith respect toactivity,meaning that tinychangesinpopulationfiringratesareassociatedwithtinychangesinthetaskvariables.Asanotherexample,taskvariablesmaychangeslowlyintime.Ingeneral,whenchoosingamethodwithwhichtodecodeneuralactivity, it is importanttochooseamethodthathasassumptionsthatmatchthepropertiesofthedata.Thestrongertheassumptionsthatcanbemade,thebetterthedecoderwillperform—aslongasthoseassumptionsarecorrect.There is no such thing as a method that makes no assumptions. This idea derives from a keytheoremintheMLliteraturecalledthe“NoFreeLunch”theorem,whichessentiallystatesthatno

algorithmwilloutperformallothersoneveryproblem[23].Thefactthatsomealgorithmsperformbetter than others in practicemeans that their assumptions about the data are better. Even themostcomplexneuralnetworksmakeassumptionsabouthowinputsrelatetooutputs,thoughtheseassumptionsareweakerthantheassumptionsmadebylinearregression(inthesensethatneuralnetworks can also express nonlinear mappings). The knowledge that all methods makeassumptionstovaryingdegreescanhelponebeintentionalaboutchoosingadecoder.Forsomeneuraldecodingproblems,theoutputsareknowntohavesomepatternorstructure,andmethodsexist thatallowthisknowledgetobe incorporatedexplicitly.Forexample, letussayweare estimating the kinematics of a movement from neural activity. The Kalman filter, which isfrequentlyusedformovementdecoding[24-26],usesadditionalinformationabouthowmovementkinematics transition over time. There are also Bayesian decoders that can incorporate priorbeliefsaboutthedecodedvariables.Forexample,theNaiveBayesdecoder,whichisfrequentlyusedforpositiondecodingfromhippocampalactivity[5,27,28],cantakeintoaccountpriorinformationabout the probability distribution of positions. Decoders with task knowledge built in areconstrainedintermsoftheirsolutions,whichcanbehelpful,providedthisknowledgeiscorrect.Itishardertoknowmuchaboutthestructureofneuralactivity.Howneuralactivitycodesfortheoutside world (the “neural code”) is usually (if not always) unknown, and for this reason it isimportant to choose decoding algorithms that can express a wide array of input/outputrelationships. For example, if neural activity relates nonlinearly or nonmonotonically to the taskvariable, a method that can express nonlinear functions will outperform a linear method like aKalmanfilter. Ingeneral,agooddecoderwill incorporatepriorknowledgeabouttheoutputs,butstill be nonlinear and relatively agnostic about the overall input/output relationship. It is nearlyalwaysbetter tomakeweakerassumptionsabout thedatathantomakestrongassumptionsthatarewrong.Theperilsofchoosingamodelthatistooexpressiveorcomplexarevisibleinthephenomenonof“overfitting”, inwhich themodel learns components of thenoise that are unique to thedata themodel is trained on, but which do not generalize to any independent test data. To combatoverfitting, one can choose a simpler algorithm that is less likely to learn tomodel this noise. Acommon intuition is that the number of free parameters in the algorithm (standing in for themethod’s expressivity) should not exceed the number of training points. Alternatively, one canapplyregularization,whichessentiallypenalizesthecomplexityofamodel.Thiseffectivelyreducestheexpressivityofamodel,whilestillallowingittohavemanyfreeparameters[29].Regularizationisanimportantwaytoincludetheassumptionthattheinput/outputrelationshipisnotarbitrarilycomplex.Maximumlikelihoodestimatesvs.posteriordistributionsDifferentclassesofmethodsalsoprovidedifferenttypesofestimates.Typically,machinelearningalgorithmsprovidemaximumlikelihoodestimatesofthedecodedvariable.Thatis,thereisasinglepointestimateforthevalueofthedecodedvariable,whichistheestimatethatismostlikelytobe

true. Unlike the typical use of machine learning algorithms2, Bayesian decoding provides aprobabilitydistributionoverallpossibilitiesforthedecodedoutputs(the“posterior”distribution),thus also providing information about the uncertainty of the estimate. Themaximum likelihoodestimateisthepeakoftheposteriordistribution.EnsemblemethodsIt is important to note that the user does not need to choose a single model. One can usuallyimproveon theperformanceofanysinglemodelbycombiningmultiplemodels, creatingwhat iscalledan“ensemble”.Ensemblescanbesimpleaveragesoftheoutputsofmanymodels,ortheycanbe more complex combinations. Weighted averages are common, which one can imagine as asecond-tier linear model that takes the outputs of the first-tier models as inputs (sometimesreferred to as a super learner). In principle, one can use anymethod as the second-tier model.Third-andfourth-tiermodelsareuncommonbutimaginable.ManysuccessfulMLmethodsare,attheir core, ensembles [30]. InML competition sites like Kaggle [31],mostwinning solutions arecomplicatedensembles.UnderwhichconditionswillMLtechniquesimprovedecodingperformance?WhetherMLwilloutperformothermethodsdependsonmanyfactors.Theformoftheunderlyingneuralcode,thelengthoftherecordingsession,andthelevelofnoisewillallaffectthepredictiveaccuracyofdifferentmethodstodifferentdegrees.There isnowaytoknowaheadof timewhichmethodwillperformthebest,andwerecommendcreatingapipelinetoquicklytestandcomparemany ML and simpler regression methods. For typical decoding situations, however, we expectcertainMLmethodstoperformbetterthansimplermethods.Inthedemonstrationbelow,weshowthat this is true across three datasets and a wide range of variables like training data length,numberofneurons,binsize,andhyperparameters.Thereasonisthat,asacollectionofmethods,MLmakes fewerassumptions about the formof thedata.Many traditionalmethodsassume thatneuraldatarelates linearly to theoutputs, forexample,or throughaspecificnonlinearitychosenbeforehand.Machine learning, on theotherhand, can learnhowoutputs relate toneural activityeveniftherelationshipisquitenonlinear,andthusoutperformothermethods.ApracticalguideforusingmachinelearningfordecodingInanydecodingproblem,onehasneuralactivityfrommultiplesourcesthatisrecordedforaperiodoftime.Whilewefocushereonspikingneurons,thesamemethodscouldbeusedwithotherformsofneuraldata,suchastheBOLDsignalinfMRI,orthepowerinparticularfrequencybandsoflocalfield potential (LFP) or electroencephalography (EEG) signals.Whenwe decode from the neuraldata,whateverthesource,wewouldliketopredictthevaluesofrecordedoutputs(Fig.1a).Dataformatting/preprocessingfordecoding

2 While not commonly practiced, it is possible for modern ML algorithms to also estimate theuncertainyofthedecodedvariables.Forexample,aneuralnetworkcanlearntoestimateavariablealongwithitsownuncertainty.

Preparingforregressionvs.classificationDepending on the task, the desired output can be variables that are continuous (e.g., velocity orposition),ordiscrete(e.g.,choices). Inthefirstcasethedecoderwillperformregression,whileinthesecondcaseitwillperformclassification.InourPythonpackage,decoderclassesarelabeledtoreflectthisdivision.Inthedataprocessingstep,takenotewhetherapredictionisdesiredcontinuouslyintime,oronlyat the end of each trial. In this tutorial, we focus on situations in which a prediction is desiredcontinuouslyintime.However,manyclassificationsituationsrequireonlyonepredictionpertrial(e.g.,whenmakingasinglechoicepertrial).Ifthisisthecase,thedatamustbepreparedsuchthatmanytimepointsinasingletrialaremappedtoasingleoutput.TimebinningdividescontinuousdataintodiscretechunksAnumberofimportantdecisionsarisefromthebasicproblemthattimeiscontinuouslyrecorded,but decoding methods generally require discrete data for their inputs (and, in the continual-predictionsituation,theiroutputs).Atypicalsolutionistodivideboththeinputsandoutputsintodiscrete timebins.Thesebinsusually contain the average input or output over a small chunkoftime,butcouldalsocontaintheminimum,maximum,ortheregularsamplingofanyinterpolationmethodfittothedata.When predictions are desired continuously in time, one needs to decide upon the temporalresolution,R, fordecoding.Thatis,dowewanttomakeapredictionevery50ms,100ms,etc?Weneed to put the input and output into bins of length R (Fig. 1a). It is common (although notnecessary)tousethesamebinsizefortheneuraldataandoutputdata,andwedosohere.Thus,ifTis the lengthof therecording,wewillhaveapproximatelyT/R totaldatapointsofneuralactivityandoutputs.Next,weneed to choose the timeperiodofneural activityused topredict a givenoutput. In thesimplestcase,theactivityfromallneuronsinagiventimebinwouldbeusedtopredicttheoutputin that same timebin.However, it is often the case thatwewant theneuraldata toprecede theoutput (e.g., in the case ofmakingmovements) or follow the decoder output (e.g., in the case ofinferringthecauseofasensation).Plus,weoftenwanttouseneuraldatafrommorethanonebin(e.g.,using500msofprecedingneuraldatatopredictamovementinthecurrent50msbin).Inthefollowing,weusethenomenclaturethatBtimebinsofneuralactivityarebeingusedtopredictagiven output. For example, ifwe use one bin preceding the output, one concurrent bin, and onefollowingbin,thenB=3(Fig.1a).Notethatwhenmultiplebinsofneuraldataareusedtopredictanoutput(B>1),thenoverlappingneuraldatawillbeusedtopredictdifferentoutputtimes(Fig.1a),makingthemdependent.Whenmultiplebinsofneuraldataareusedtopredictanoutput,thenwewillneedtoexcludesomeoutput bins. For instance, if we are using one bin of neural data preceding the output, thenwecannotpredictthefirstoutputbin,andifweareusingonebinofneuraldatafollowingtheoutput,

thenwecannotpredictthefinaloutputbin(Fig.1a).Thus,wewillbepredictingKtotaloutputbins,whereKislessthanthetotalnumberofbins(T/R).Tosummarize,ourdecoderswillbepredictingeachoftheseKoutputsusingBsurroundingbinsofactivityfromNneurons.TheformatoftheinputdatadependsontheformofthedecoderNon-recurrentdecoders:Formost standard regressionmethods, thedecoderhasnopersistentinternalstateormemory.InthiscaseNxBfeatures(thefiringratesofeachneuronineachrelevanttime bin) are used to predict each output (Fig. 1b). The inputmatrix of covariates,X, hasN xBcolumns(oneforeachfeature)andKrows(correspondingtoeachoutputbeingpredicted).Ifthereisasingleoutputthatisbeingpredicted,itcanbeputinavector,Y,oflengthK.Notethatformanydecoders,iftherearemultipleoutputs,eachisindependentlydecoded.Ifmultipleoutputsarebeingsimultaneouslypredicted,whichcanoccurwithneuralnetworkdecoders,theoutputscanbeputinamatrixY,thathasKrowsanddcolumns,wheredisthenumberofoutputsbeingpredicted.Sincethis is the format of a standard regression problem, many regression methods can easily besubstitutedforoneanotheroncethedatahasbeenpreparedinthisway.Recurrent neural network decoders: When using recurrent neural networks (RNNs) fordecoding, we need to put the inputs in a different format. Recurrent neural networks explicitlymodel temporal transitions across timewith apersistent internal state, called the “hidden state”(Fig.1c).Ateach timeof theB timebins, thehiddenstate isadjustedasa functionofboth theNfeatures(thefiringratesofallneuronsinthattimebin)andthehiddenstateattheprevioustimebin (Fig. 1c). After transitioning through all B bins, the hidden state in this final bin is used topredict theoutput. In thisway anRNNdecoder can integrate the effect of neural inputs over anextended period of time. For use in this type of decoder, the input can be formatted as a 3-dimensionaltensorofsizeKxNxB(Fig.1c).Thatis,foreachrow(correspondingtooneoftheKoutputbinstobepredicted),therewillbeNfeatures(2ndtensordimension)overBbins(3rdtensordimension)usedforprediction.Withinthis format,differenttypesofRNNs, includingthosemoresophisticatedthanthestandardRNNshowninFig.1c,canbeeasilyswitchedforoneanother.

Figure1:DecodingSchematica)Todecode(predict)theoutputinagiventimebin,weusedthefiringratesofallNneuronsinBtimebins.Inthisschematic,N=4andB=3(onebinprecedingtheoutput,oneconcurrentbin,andonefollowingbin).Here, we show a single output being predicted. b) For non-recurrent decoders (Wiener Filter, WienerCascade, Support Vector Regression, XGBoost, and Feedforward Neural Network in our subsequentdemonstration), this is a standardmachine learning regression problemwhereN xB features (the firingratesofeachneuron ineachrelevant timebin)areused topredict theoutput.c)Topredictoutputswithrecurrentdecoders(simplerecurrentneuralnetwork,GRUs,LSTMsinoursubsequentdemonstration)weused N features, with temporal connections across B bins. A schematic of a recurrent neural networkpredictingasingleoutputisontheright.

ApplyingmachinelearningThe typical process of applying ML to data involves testing several methods and seeing whichworksbest.Somemethodsmaybeexpectedtoworkbetterorworsedependingonthestructureof

thedata,andinthenextpartofthistutorialweprovideademonstrationofwhichmethodsworkbestfortypicalneuralspikingdatasets.Thatsaid,applyingMLisunavoidablyaniterativeprocess,andforthisreasonwebeginwiththeproperwaytochooseamongmanymethods.Itisparticularlyimportanttoconsiderhowtoavoidoverfittingourdataduringthisiterativeprocess.Comparemethodperformanceusingcross-validationGiven two (ormore)methods, theML practitionermust have a principledway to decidewhichmethodisbest.Itiscrucialtotestthedecoderperformanceonaseparate,held-outdatasetthatthedecoderdidnotseeduringtraining(Fig.2a).This isbecauseadecodertypicallywilloverfit to itstraining data. That is, itwill learn to predict outputs using idiosyncratic information about eachdatapointinthetrainingset,suchasnoise,ratherthantheaspectsthataregeneraltothedata.Anoverfit algorithm will describe the training data well but will not be able to provide goodpredictionsonnewdatasets.Forthisreason,thepropermetrictochoosebetweenmethodsistheperformance on held-out data, meaning that the available data should be split into separate“training”and“testing”datasets.Inpractice,thiswillreducetheamountoftrainingdataavailable.Inordertoefficientlyuseallofthedata,itiscommontoperformcross-validation(Fig.2b).In10-foldcross-validation,forexample,thedatasetissplitinto10sets.Thedecoderistrainedon9ofthesets,andperformanceistestedonthefinalset.Thisisdone10timesinarotatingfashion,sothateachsetistestedonce.Theperformanceonalltestsetsisgenerallyaveragedtogethertodeterminetheoverallperformance.Whenpublishingabouttheperformanceofamethod, it is importanttoshowtheperformanceonheld-out data thatwas additionally not used to select betweenmethods (Fig. 2c). Randomnoisemayleadsomemethodstoperformbetterthanothers,anditispossibletofooloneselfandothersby testingagreatnumberofmethodsandchoosing thebest. In fact,onecanentirelyoverfit toatestingdatasetsimplybyusingthetestperformancetoselectthebestmethod,evenifnomethodistrainedonthatdataexplicitly.Practitionersoftenmakethedistinctionbetweenthe“training”data,the “validation”data (which is used to select betweenmethods) and the “testing” data (which isused toevaluate the trueperformance).This three-waysplitofdata complicates cross-validationsomewhat(Fig.2b).Itispossibletofirstcreateaseparatetestingset,thenchooseamongmethodson theremainingdatausingcross-validation.Alternatively, formaximumdataefficiency,onecaniterativelyrotatewhichsplitisthetestingset,andthusperformatwo-tiernestedcross-validation.HyperparameteroptimizationWhen fitting any singlemethod, oneusuallyhas to additionally choose a set ofhyperparameters.These are parameters that relate to thedesign of thedecoder itself, and shouldnot be confusedwiththe‘normal’parametersthatarefitduringoptimization,e.g.,theweightsthatarefitinlinearregression. For example, neural networks can be designed to have any number of hidden units.Thus, theuserneeds toset thenumberofhiddenunits (thehyperparameter)before training thedecoder.Oftendecodershavemultiplehyperparameters,anddifferenthyperparametervaluescansometimes lead to greatly different performance. Thus, it is important to choose a decoder’shyperparameterscarefully.

Whenusingadecoderthathashyperparameters,oneshouldtakethefollowingsteps.First,alwayssplit thedata into threeseparatesets (trainingset, testingset,andvalidationset),perhapsusingnestedcross-validation(Fig.2b).Next,iteratethroughalargenumberofhyperparametersettingsand choose thebestbasedonvalidation setperformance. Simplemethods for searching throughhyperparametersareagridsearch(i.e.,samplingvaluesevenly)andrandomsearch[32].Therearealsomoreefficientmethods(e.g., [33,34]) thatcan intelligentlysearch throughhyperparametersbased on the performance of previously tested hyperparameters. The best performinghyperparameterandmethodcombination (on thevalidationset)willbe the finalmethod,unlessoneiscombiningmultiplemethodsintoanensemble.

Figure2.Schematicofcross-validationa)Aftertrainingadecoderonsomedata(green),wewouldliketoknowhowwellitperformsonheld-outdatathatwedonothaveaccessto(red).b)Left:Bysplittingthedatawehaveintotest(orange)andtrain(green)segments,wecanapproximatetheperformanceonheld-outdata.Ink-foldcross-validation,were-trainthedecoderktimes,andeachtimerotatewhichpartsofthedataarethetestortraindata.Theaveragetest set performance approximates the performance on held-out data.Right: If wewant to select amongmanymodels,wecannotmaximizetheperformanceonthesamedatawewillreportasascore.(Thisisverysimilarto“p-hacking”statisticalsignificance.)Insteadwemaximizeperformanceon“validation”data(blue),

and again rotate through the available data. c) All failure modes are ways in which a researcher letsinformationfromthetestset“leak”intothetrainingalgorithm.Thishappensifyouexplicitlytrainonthetestdata(top),oruseanystatisticsofthetestdatatomodifythetraindatabeforefitting(middle),orselectyourmodelsorhyperparametersbasedontheperformanceonthetestdata(bottom).

AdemonstrationandcomparisonofmachinelearningfordecodingWehavepreparedaPythonpackagethat implements themachine learningpipeline fordecodingproblems.Itisavailableathttps://github.com/KordingLab/Neural_Decoding,andincludescodetocorrectlyformattheneuralandoutputdatafordecoding,toimplementmanydifferentdecodersforboth regression and classification, and to optimize their hyperparameters. In this section, wedemonstrate its usage, and showwhich of itsmethodswork better than traditionalmethods forseveralneuraldatasets.Note that additional details, discussion, and demonstrations can be found in the SupplementalMaterials.Specificdecoders:The following decoders are available in our package, andwe demonstrate their performance onexampledatasetsbelow.Weincludedbothhistoricallineartechniques(e.g.,theWienerfilter)andmodernMLtechniques(e.g.,neuralnetworksandensemblesoftechniques).MoredetailsaboutthepreciseimplementationofthesemethodscanbefoundinSupplementalInformation.WienerFilter:TheWienerfilterusesmultiplelinearregressiontopredicttheoutputfrommultipletimebinsofallneurons’spikes.Thatis,theoutputisassumedtobealinearmappingofthenumberofspikesintherelevanttimebinsfromeveryneuron(Fig.1a,b).Wiener Cascade: The Wiener cascade (also known as a linear-nonlinear model) fits a linearregression(theWiener filter) followedbya fittedstaticnonlinearity(e.g., [35]).Thisallows foranonlinear relationship between the input and the output, and assumes that this nonlinearity ispurelyafunctionofthelinearoutput.Thedefaultnonlinearcomponentisapolynomialwithdegreethatcanbedeterminedonavalidationset.SupportVectorRegression:Insupportvectorregression(SVR)[36],theinputsareprojectedintoahigher-dimensionalspaceusinganonlinearkernel,andthenlinearlymappedfromthisspacetotheoutputtominimizeanobjectivefunction[36].Inourtoolbox,thedefaultkernelisaradialbasisfunction.XGBoost: XGBoost (Extreme Gradient Boosting) [37] is an implementation of gradient boostedtrees.Tree-basedmethodssequentiallysplittheinputspaceintomanydiscreteparts(visualizedasbranchesonatreeforeachsplit),inordertoassigneachfinal“leaf”(aportionofinputspacethatisnotsplitanymore)avalueinoutputspace[38].XGBoostfitsmanyregressiontrees,whicharetreesthat predict continuous output values. “Gradient boosting” refers to fitting each subsequentregressiontreetotheresidualsofthepreviousfit.

FeedforwardNeuralNetwork:Afeedforwardneuralnetconnectstheinputstosequentiallayersofhiddenunits,whichthenconnect to theoutput.Each layerconnects to thenext(e.g., the inputlayertothefirsthiddenlayer,orthefirsttosecondhiddenlayers)vialinearmappingsfollowedbynonlinearities.Notethat theWienercascade isaspecialcaseofaneuralnetworkwithnohiddenlayers.Simple RNN: In a standard recurrent neural network (RNN), the hidden state is a linearcombinationoftheinputsandtheprevioushiddenstate.Thishiddenstateisthenrunthroughanoutputnonlinearity,andlinearlymappedtotheoutput.RNNs,unlikefeedforwardneuralnetworks,allowtemporalchangesinthesystemtobemodeledexplicitly.GatedRecurrentUnit:Gated recurrentunits (GRUs) [39] are amore complex typeof recurrentneural network. It has gated units, which determine how much information can flow throughvariouspartsofthenetwork.Inpractice,thesegatedunits allowforbetter learningof long-termdependencies.LongShortTermMemoryNetwork:LiketheGRU,thelongshorttermmemory(LSTM)network[40]isamorecomplexrecurrentneuralnetworkwithgatedunitsthatfurtherimprovethecaptureoflong-termdependencies.TheLSTMhasmoreparametersthantheGRU.KalmanFilter:OurKalmanfilterforneuraldecodingwasbasedon[24].IntheKalmanfilter,thehiddenstateattimetisalinearfunctionofthehiddenstateattimet-1,plusamatrixcharacterizingthe uncertainty.For neural decoding, the hidden state is the kinematics (x and y components ofposition, velocity, and acceleration). Note that even though we only aim to predict position orvelocity,allkinematicsareincludedbecausethisallowsforbetterprediction.Naïve Bayes: The Naïve Bayes decoder is a type of Bayesian decoder that determines theprobabilities of different outcomes, and it then predicts the most probable. Briefly, it fits anencodingmodeltoeachneuron,makesconditionalindependenceassumptionsaboutneurons,andthenusesBayes’ruletocreateadecodingmodelfromtheencodingmodels[5].Thisprobabilisticframeworkcanincorporatepriorinformationaboutthevariables.Inthecomparisonsbelow,wealsodemonstrateanensemblemethod.WecombinedthepredictionsfromalldecodersexcepttheKalmanfilterandNaïveBayesdecoders(whichhavedifferentformats)usingafeedforwardneuralnetwork.Thatis,theeightmethods’predictionswereprovidedasinputintoafeedforwardneuralnetworkthatwetrainedtopredictthetrueoutput.DemonstrationdatasetsWe first examinedwhich of several decodingmethods performed the best across three datasetsfrommotorcortex,somatosensorycortex,andhippocampus.In the task for decoding frommotor cortex,monkeysmoved amanipulandum that controlled acursoronascreen[19],andweaimedtodecodethexandyvelocityofthecursor(Fig.S1).The21minute recording frommotor cortex contained 164 neurons. Themean andmedian firing rates,respectively,were6.7and3.4spikes/sec.Datawereputinto50msbins.Weused700msofneuralactivity(13binsbeforeandtheconcurrentbin)topredictthecurrentmovementvelocity.

Thesametaskwasusedintherecordingfromsomatosensorycortex[41].TherecordingfromS1was51minutes,andcontained52neurons.Themeanandmedianfiringrates,respectively,were9.3and6.3spikes/sec.Datawereputinto50msbins.Weused650mssurroundingthemovement(6binsbefore,theconcurrentbin,and6binsafter).In the task for decoding fromhippocampus, rats chased rewards on a platform [42, 43], andweaimedtodecodetherat’sxandyposition(Fig.S1).Fromthisrecording,weused46neuronsoveratimeperiodof75minutes.Theseneuronshadmeanandmedianfiringratesof1.7and0.2spikes/sec,respectively.Datawereputinto200msbins.Weused2secondsofsurroundingneuralactivity(4binsbefore, theconcurrentbin,and5binsafter) topredict thecurrentposition.Notethat thebins used for decoding differed in all tasks for the Kalman filter andNaïve Bayes decoders (seeSupplementalInformationforadditionaldetails).PerformancecomparisonIn order to get a qualitative impression of the performance, we first plotted the output of eachdecodingmethod foreachof the threedatasets (Fig.3). In theseexamples, themodernmethods,such as the LSTM and ensemble, appeared to outperform traditional methods. We nextquantitatively compared themethods’ performances, using themetric ofR2 onheld-out test sets(seeSupplementalInformationfordetails).Theseresultsconfirmedourqualitativefindings(Fig.4).Inparticular,neuralnetworksandtheensembleledtothebestperformance,whiletheWienerorKalmanFilter ledtotheworstperformance. Infact, theLSTMdecoderexplainedover40%oftheunexplainedvariancefromaWienerfilter(R2’sof0.88,0.86,0.62vs.0.78,0.75,0.35).Interestingly,while the Naïve Bayes decoder performed relatively well when predicting position in thehippocampus dataset (mean R2 just slightly less than the neural networks), it performed verypoorlywhenpredictinghandvelocitiesintheothertwodatasets.AnotherinterestingfindingisthatthefeedforwardneuralnetworkdidalmostaswellastheLSTMinallbrainareas.Acrosscases,theensemblemethodaddeda reliable,but small increase to theexplainedvariance.Overall,modernMLmethodsledtosignificantincreasesinpredictivepower.

Figure3:ExampleDecoderResultsExample decoding results from motor cortex (left), somatosensory cortex (middle), and hippocampus(right),forallelevenmethods(toptobottom).Groundtruthtracesareinblack,whiledecoderresultsareinvariouscolors.

Figure4:DecoderResultSummaryR2 values are reported for all decoders (different colors) for each brain area (top to bottom). Error barsrepresent the mean +/- SEM across cross-validation folds. X’s represent the R2 values of each cross-validationfold.TheNBdecoderhadmeanR2valuesof0.26and0.36(belowtheminimumy-axisvalue)forthe motor and somatosensory cortex datasets, respectively. Note the different y-axis limits for thehippocampusdatasetinthisandallsubsequentfigures.

ConcernsaboutlimiteddatafordecodingWechosearepresentativesubsetofthetenmethodstopursuefurtherquestionsaboutparticularaspectsofneuraldataanalysis: thefeedforwardneuralnetworkandLSTM(twoeffectivemodernmethods),alongwiththeWienerandKalmanfilters(twotraditionalmethodsinwidespreaduse).The improved predictive performance of the modern methods is likely due to their greatercomplexity. However, this greater complexity may make these methods unsuitable for smalleramountsofdata.Thus,wetestedperformancewithvaryingamountsoftrainingdata.Withonly2minutesofdataformotorandsomatosensorycortices,and15minutesofhippocampusdata,bothmodernmethodsoutperformedboth traditionalmethods (Fig.5a, Fig. S2). Whendecreasing theamount of trainingdata further, to only 1minute formotor and somatosensory cortices and7.5minutes forhippocampusdata, theKalman filterperformancewas sometimes comparable to themodernmethods,butthemodernmethodssignificantlyoutperformedtheWienerFilter(Fig.5a).Thus,evenforlimitedrecordingtimes,modernMLmethodscanyieldsignificantgainsindecodingperformance.Besides limited recording times, neural data is often limited in thenumberof recordedneurons.Thus,wecomparedmethodsusingasubsetofonly10neurons.Formotorandsomatosensorydata,despite a general decrease in performance for all decoding methods, the modern methodssignificantly outperformed the traditional methods (Fig. 5b). For the hippocampus dataset, nomethodpredictedwell (meanR2<0.25)withonly10neurons.This is likelybecause10sparselyfiring neurons (median firing of HC neurons was ~0.2 spikes / sec) did not contain enoughinformationabouttheentirespaceofpositions.However,inmostscenarios,withlimitedneuronsandforlimitedrecordedtimes,modernMLmethodscanbeadvantageous.

Figure5:Decoderresultswithlimiteddata(A) Testing the effects of limited training data. Using varying amounts of training data, we trained twotraditional methods (Wiener filter and Kalman filter), and two modern methods (feedforward neuralnetworkandLSTM).R2valuesarereportedforthesedecoders(differentcolors)foreachbrainarea(toptobottom). Error bars are 68% confidence intervals (meant to approximate the SEM) produced viabootstrapping,asweusedasingletestset.ValueswithnegativeR2swerenotshown.(B)Testingtheeffects

of few neurons. Using only 10 neurons, we trained two traditional methods (Wiener filter and Kalmanfilter),andtwomodernmethods(feedforwardneuralnetworkandLSTM).WeusedthesametestingsetasinpanelA,andthelargesttrainingsetfrompanelA.R2valuesarereportedforthesedecodersforeachbrainarea.Errorbarsrepresentthemean+/-SEMofmultiplerepetitionswithdifferentsubsetsof10neurons.X’srepresenttheR2valuesofeachrepetition.Notethatthey-axislimitsaredifferentinpanelsAandB.

Concernsaboutrun-timeWhile modern ML methods lead to improved performance, it is important to know that thesesophisticateddecodingmodelscanbetrainedinareasonableamountoftime.Togetafeelingforthetypicaltimescale,considerthatwhenrunningourdemonstrationdataonadesktopcomputerusing only CPUs, it took less than 1 second to fit a Wiener filter, less than 10 seconds to fit afeedforwardneural,andlessthan8minutestofitanLSTM.(Thiswasfor30minutesofdata,using10timebinsofneuralactivityforeachprediction,and50neurons.)Inpractice,thesemodelswillneed to be fit tens to hundreds of times when incorporating hyperparameter optimization andcross-validation.As a concrete example, let us say thatwe are doing5-fold cross-validation, andhyperparameteroptimization requires50 iterationsper cross-validation fold. In that case, fittingtheLSTMwouldtakeabout30hours,andfittingthefeedforwardneuralnetwouldtakelessthan1hour. Note that variations in hardware and software implementations can drastically changeruntime,andmodernGPUscanoften increasespeedsapproximately10-fold.Thus,whilemodernmethodsdo takesignificantly longer to train (andmayrequirerunningovernightwithoutGPUs),thatshouldbemanageableformostofflineapplications.ConcernsaboutrobustnesstohyperparametersAll our previous results used hyperparameter optimization. While we strongly encourage athorough hyperparameter optimization, a user with limited time might just do a limitedhyperparameter search. Thus, it is helpful to know how sensitive results may be to varyinghyperparameters.Wetestedtheperformanceofthefeedforwardneuralnetworkwhilevaryingtwohyperparameters: thenumberofunitsandthedropoutrate(aregularizationhyperparameterforneuralnetworks).Weheldthethirdhyperparameterinourcodepackage,thenumberoftrainingepochs,constantat10.Wefoundthattheperformanceoftheneuralnetworkwasgenerallyrobustto large changes in the hyperparameters (Fig. 6). As an example, for the somatosensory cortexdataset,thepeakperformanceoftheneuralnetworkwasR2=0.86with1000unitsand0dropout,andvirtuallythesame(R2=0.84)with300unitsand30%dropout.Evenwhenusinglimiteddata,neuralnetworkperformancewasrobust tohyperparameterchanges.For instance,whentrainingthe somatosensory cortex dataset with 1 minute of training data, the peak performance wasR2=0.77with700unitsand20%dropout.Anetworkwith300unitsand30%dropouthadR2=0.75.Notethatthehippocampusdataset,inparticularwhenusinglimitedtrainingdata,didhavegreatervariability, emphasizing the importance of hyperparameter optimization on sparse datasets.However, for most datasets, researchers should not be concerned that slightly non-optimalhyperparameterswillleadtolargelydegradedperformance.

Figure6:SensitivityofneuralnetworkresultstohyperparameterselectionInafeedforwardneuralnetwork,wevariedthenumberofhiddenunitsperlayer(inincrementsof100)andthe proportion of dropout (in increments of 0.1), and evaluated the decoder’s performance on all threedatasets(toptobottom).Theneuralnetworkhadtwohiddenlayers,eachwiththesamenumberofhiddenunits.Thenumberoftrainingepochswaskeptconstantat10.ThecolorsshowtheR2onthetestset,andeachpanel’scolorswereputintherange:[maximumR2–0.2,maximumR2].a)Weusedalargeamountoftrainingdata (themaximumamountused inFig.5a),whichwas10,20, and37.5minutesofdata for themotorcortex,somatosensorycortex,andhippocampusdatasets,respectively.b)Sameresultsforalimitedamount of training data: 1, 1, and 15 minutes of data for the motor cortex, somatosensory cortex, andhippocampusdatasets,respectively.

DiscussionofthisdemonstrationOurcomparisons,whichweremadeusingthecodewehavepublishedonline,showthatmachinelearning works well on typical neural decoding datasets, outperforming traditional decodingmethods. In our demonstration, we decoded continuous-valued variables. However, these samemethods can be used for classification tasks, which often use classic decoders such as logisticregressionandsupportvectormachines.Ouravailablecodealsoincludesclassificationmethods.Wehavedecodedfromspikingdata,butitispossiblethattheproblemofdecodingfromotherdatamodalities is different. One main driver of a difference may be the distinct levels of noise. Forexample, fMRI signals have far higher noise levels than spikes. As the noise level goes up, lineartechniques become more appropriate, which may ultimately lead to a situation where thetraditional linear techniques become superior. Applying the same analyses we did here acrossdifferentdatamodalitiesisanimportantnextstep.Allourdecodingwasdone“offline,”meaning that thedecodingoccurredafter therecording,andwasnotpartofacontrolloop.Thistypeofdecodingisusefulfordetermininghowinformationinaparticularbrainarearelatestoanexternalvariable.However,forengineeringapplicationssuchasBMIs[44,45],thegoalistodecodeinformation(e.g.,predictmovements)inrealtime.Ourresultsheremaynotapplyasdirectlytoonlinedecodingsituations,sincethesubjectisultimatelyabletoadapt to imperfections in the decoder. In that case, even relatively large decoder performancedifferencesmaybe irrelevant.Plus, thereareadditionalchallenges inonlineapplications,suchasnon-stationary inputs (e.g., due to electrodes shifting in the brain) [25, 46, 47]. Finally, onlineapplications are concerned with computational runtime, which we have only briefly addressedhere. In the future, it would be valuable to test modern techniques for decoding in onlineapplications(asin[46,48]).ConclusionMachinelearningisrelativelystraightforwardtoapplyandcangreatlyimprovetheperformanceofneuraldecoding.Itsprincipleadvantageisthatmanyfewerassumptionsneedtobemadeaboutthestructureof theneural activity and thedecodedvariables.Bestpractices like testingonheld-outdata,perhapsviacrossvalidation,arecrucialtotheMLpipelineandarecriticalforanyapplication.ThePythonpackagethataccompaniesthistutorialisdesignedtoguidebothbestpracticesandthedeployment of specific ML algorithms, and we expect it will be useful in improving decodingperformancefornewdatasets.Ourhunchisthatitwillbehardforspecializedalgorithms(e.g.[49,50])tocompetewiththestandardalgorithmsdevelopedbythemachinelearningcommunity.AcknowledgementsWewould like to thank Pavan Ramkumar for helpwith code development. For funding, JGwassupportedbyNIHF31EY025532andNIHT32HD057845.ABwassupportedbyNIHMH103910.

MPwassupportedbyNIHF31NS092356andNIHT32HD07418.RCwassupportedbyNIHR01NS095251andDGE-1324585.LMwassupportedbyNIHR01NS074044andNIHR01NS095251.KKwassupportedbyNIHR01NS074044,NIHR01NS063399andNIHR01EY021579.

References

1. SerruyaMD,HatsopoulosNG,PaninskiL,FellowsMR,DonoghueJP.Brain-machineinterface:Instantneuralcontrolofamovementsignal.Nature.2002;416(6877):141-2.2. Ethier C, Oby ER, Bauman MJ, Miller LE. Restoration of grasp following paralysis through brain-controlled stimulation of muscles. Nature. 2012;485(7398):368-71. doi: 10.1038/nature10987. PubMedPMID:22522928;PubMedCentralPMCID:PMCPMC3358575.3. BaegE,KimY,HuhK,Mook-JungI,KimH,JungM.Dynamicsofpopulationcodeforworkingmemoryintheprefrontalcortex.Neuron.2003;40(1):177-88.4. IbosG,FreedmanDJ.Sequentialsensoryanddecisionprocessing inposteriorparietalcortex.eLife.2017;6.5. Zhang K, Ginzburg I, McNaughton BL, Sejnowski TJ. Interpreting neuronal population activity byreconstruction: unified framework with application to hippocampal place cells. J Neurophysiol.1998;79(2):1017-44.6. Davidson TJ, Kloosterman F, Wilson MA. Hippocampal replay of extended experience. Neuron.2009;63(4):497-507.7. HeK,ZhangX,RenS,Sun J.DelvingDeep intoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification.2015IEEEInternationalConferenceonComputerVision(ICCV);20152015.8. SilverD,HuangA,MaddisonCJ,GuezA,SifreL,vandenDriesscheG,etal.MasteringthegameofGowithdeepneuralnetworksandtreesearch.Nature.2016;529(7587):484-9.doi:10.1038/nature16961.9. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436-44. doi:10.1038/nature14539.PubMedPMID:26017442.10. GlaserJI,BenjaminAS,FarhoodiR,KordingKP.Therolesofsupervisedmachinelearninginsystemsneuroscience.ProgNeurobiol.2019.11. CollingerJL,WodlingerB,DowneyJE,WangW,Tyler-KabaraEC,WeberDJ,etal.High-performanceneuroprostheticcontrolbyanindividualwithtetraplegia.TheLancet.2013;381(9866):557-64.12. Raposo D, Kaufman MT, Churchland AK. A category-free neural population supports evolvingdemandsduringdecision-making.NatNeurosci.2014;17(12):1784-92.13. Rich EL, Wallis JD. Decoding subjective decisions from orbitofrontal cortex. Nat Neurosci.2016;19(7):973-80.14. Hung CP, Kreiman G, Poggio T, DiCarlo JJ. Fast readout of object identity from macaque inferiortemporalcortex.Science.2005;310(5749):863-6.15. QuirogaRQ,SnyderLH,BatistaAP,CuiH,AndersenRA.Movementintentionisbetterpredictedthanattentionintheposteriorparietalcortex.JNeurosci.2006;26(13):3615-20.16. HernándezA,NácherV,LunaR,ZainosA,LemusL,AlvarezM,etal.Decodingaperceptualdecisionprocessacrosscortex.Neuron.2010;66(2):300-14.17. van derMeerMA, JohnsonA, Schmitzer-TorbertNC, RedishAD. Triple dissociation of informationprocessingindorsalstriatum,ventralstriatum,andhippocampusonalearnedspatialdecisiontask.Neuron.2010;67(1):25-32.18. DeklevaBM,RamkumarP,WandaPA,KordingKP,MillerLE.Uncertaintyleadstopersistenteffectsonreachrepresentationsindorsalpremotorcortex.eLife.2016;5:e14316.doi:10.7554/eLife.14316.19. GlaserJI,PerichMG,RamkumarP,MillerLE,KordingKP.Populationcodingofconditionalprobabilitydistributions in dorsal premotor cortex. Nature Communications. 2018;9(1):1788. doi: 10.1038/s41467-018-04062-6.20. WeygandtM,BleckerCR,SchäferA,HackmackK,HaynesJ-D,VaitlD,etal.fMRIpatternrecognitioninobsessive–compulsivedisorder.Neuroimage.2012;60(2):1186-93.21. Naufel S, Glaser JI, Kording KP, Perreault EJ,Miller LE. Amuscle-activity-dependent gain betweenmotorcortexandEMG.JNeurophysiol.2018;121(1):61-73.22. PaganM, Simoncelli EP,RustNC.Neural quadratic discriminant analysis:NonlineardecodingwithV1-likecomputation.NeuralComput.2016;28(11):2291-319.23. Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE transactions onevolutionarycomputation.1997;1(1):67-82.

24. WuW,BlackMJ,GaoY,SerruyaM,ShaikhouniA,DonoghueJ,etal.,editors.NeuraldecodingofcursormotionusingaKalmanfilter.AdvNeuralInfProcessSyst;2003.25. WuW,HatsopoulosNG.Real-timedecodingof nonstationaryneural activity inmotor cortex. IEEETransNeuralSystRehabilEng.2008;16(3):213-22.26. Gilja V, Nuyujukian P, Chestek CA, Cunningham JP, Byron MY, Fan JM, et al. A high-performanceneuralprosthesisenabledbycontrolalgorithmdesign.NatNeurosci.2012;15(12):1752.27. Barbieri R, Wilson MA, Frank LM, Brown EN. An analysis of hippocampal spatio-temporalrepresentationsusingaBayesianalgorithmforneuralspiketraindecoding.IEEETransNeuralSystRehabilEng.2005;13(2):131-6.28. KloostermanF, LaytonSP,ChenZ,WilsonMA.Bayesiandecodingusingunsorted spikes in the rathippocampus.JNeurophysiol.2013;111(1):217-27.29. Friedman J,HastieT,TibshiraniR.Theelementsof statistical learning:Springer series in statisticsNewYork;2001.30. LiawA,WienerM.ClassificationandregressionbyrandomForest.Rnews.2002;2(3):18-22.31. Kaggle.Availablefrom:https://www.kaggle.com.32. BergstraJ,BengioY.Randomsearchforhyper-parameteroptimization.JournalofMachineLearningResearch.2012;13(Feb):281-305.33. Snoek J, Larochelle H, Adams RP, editors. Practical bayesian optimization of machine learningalgorithms.AdvNeuralInfProcessSyst;2012.34. Hyperopt.http://hyperopt.github.io/hyperopt/.35. PohlmeyerEA,SollaSA,PerreaultEJ,MillerLE.Predictionofupperlimbmuscleactivityfrommotorcorticaldischargeduringreaching.Journalofneuralengineering.2007;4(4):369.36. ChangC-C,LinC-J.LIBSVM:a library forsupportvectormachines.ACMTransactionsonIntelligentSystemsandTechnology(TIST).2011;2(3):27.37. ChenT,GuestrinC,editors.Xgboost:Ascalabletreeboostingsystem.Proceedingsofthe22NdACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining;2016:ACM.38. BreimanL.Classificationandregressiontrees:Routledge;2017.39. ChoK,VanMerriënboerB,GulcehreC,BahdanauD,BougaresF,SchwenkH,etal.Learningphraserepresentations using RNN encoder-decoder for statistical machine translation. arXiv preprintarXiv:14061078.2014.40. HochreiterS,SchmidhuberJ.Longshort-termmemory.NeuralComput.1997;9(8):1735-80.41. Benjamin AS, Fernandes H, Tomlinson T, Ramkumar P, VerSteeg C, Chowdhury R, et al. Modernmachinelearningasabenchmarkforfittingneuralresponses.FrontComputNeurosci.2018;12:56.42. MizusekiK,SirotaA,PastalkovaE,BuzsákiG.Thetaoscillationsprovidetemporalwindowsforlocalcircuitcomputationintheentorhinal-hippocampalloop.Neuron.2009;64(2):267-80.43. MizusekiK,SirotaA,PastalkovaE,BuzsákiG.Multi-unitrecordingsfromtherathippocampusmadeduringopenfieldforaging.2009.doi:http://dx.doi.org/10.6080/K0Z60KZ9.44. Kao JC, Stavisky SD, Sussillo D, Nuyujukian P, Shenoy KV. Information systems opportunities inbrain–machineinterfacedecoders.ProceedingsoftheIEEE.2014;102(5):666-82.45. Nicolas-AlonsoLF,Gomez-GilJ.Braincomputerinterfaces,areview.Sensors.2012;12(2):1211-79.46. SussilloD,StaviskySD,KaoJC,RyuSI,ShenoyKV.Makingbrain–machineinterfacesrobusttofutureneuralvariability.Naturecommunications.2016;7:13749.47. FarshchianA,GallegoJA,CohenJP,BengioY,MillerLE,SollaSA.AdversarialDomainAdaptationforStableBrain-MachineInterfaces.arXivpreprintarXiv:181000045.2018.48. SussilloD,NuyujukianP, Fan JM,Kao JC, Stavisky SD,Ryu S, et al. A recurrent neural network forclosed-loop intracortical brain–machine interface decoders. Journal of neural engineering.2012;9(2):026027.49. CorbettE,PerreaultE,KoerdingK,editors.Mixtureoftime-warpedtrajectorymodelsformovementdecoding.AdvNeuralInfProcessSyst;2010.50. Kao JC, Nuyujukian P, Ryu SI, Shenoy KV. A high-performance neural prosthesis incorporatingdiscretestateselectionwithhiddenMarkovmodels.IEEETransBiomedEng.2017;64(4):935-45.51. London BM, Miller LE. Responses of somatosensory area 2 neurons to actively and passivelygeneratedlimbmovements.JNeurophysiol.2013;109(6):1505-13.

52. FaggAH,OjakangasGW,MillerLE,HatsopoulosNG.Kinetictrajectorydecodingusingmotorcorticalensembles.IEEETransNeuralSystRehabilEng.2009;17(5):487-96.53. NadeauC,BengioY,editors.Inferenceforthegeneralizationerror.AdvNeuralInfProcessSyst;2000.54. CholletF.Keras.2015.55. Glorot X, Bordes A, Bengio Y, editors. Deep sparse rectifier neural networks. Proceedings of theFourteenthInternationalConferenceonArtificialIntelligenceandStatistics;2011.56. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way topreventneuralnetworksfromoverfitting.JournalofMachineLearningResearch.2014;15(1):1929-58.57. KingmaD,BaJ.Adam:Amethodforstochasticoptimization.arXivpreprintarXiv:14126980.2014.58. TielemanT,HintonG.Lecture6.5-RmsProp:Divide thegradientbyarunningaverageof its recentmagnitude.COURSERA:NeuralNetworksforMachineLearning.2012.59. ParkIM,ArcherEW,PriebeN,PillowJW,editors.Spectralmethodsforneuralcharacterizationusinggeneralizedquadraticmodels.AdvNeuralInfProcessSyst;2013.60. KrizhevskyA,SutskeverI,HintonGE,editors.Imagenetclassificationwithdeepconvolutionalneuralnetworks.AdvNeuralInfProcessSyst;2012.61. Zhang C, Bengio S, HardtM, Recht B, Vinyals O. Understanding deep learning requires rethinkinggeneralization.arXivpreprintarXiv:161103530.2016.62. Gao P, Trautmann E, Byron MY, Santhanam G, Ryu S, Shenoy K, et al. A theory of multineuronaldimensionality,dynamicsandmeasurement.bioRxiv.2017:214262.63. Shenoy KV, Sahani M, Churchland MM. Cortical control of arm movements: a dynamical systemsperspective. Annu Rev Neurosci. 2013;36:337-59. doi: 10.1146/annurev-neuro-062111-150509. PubMedPMID:23725001.

SupplementalInformationDemonstrationMethodsAmore completedescriptionof themethodsused inourdemonstrations follows.Whilemuchofthisdescriptionoverlapswiththebriefdescriptionprovidedinthemaintext,wehaveincludedthecompletedescriptionheresothatreadersinterestedinthemethodsdetailscanusethissupplementasaprimaryresource,ratherthanhavingtoswitchbetweendocuments.DemonstrationdatasetsDecodingmovement velocity from themotor and somatosensory cortices: In our “random-target”experiment [19],monkeysmovedaplanarmanipulandumthatcontrolledacursoron thescreen(Fig.S1a).Themonkeyscontinuouslyreachedtonewlypresentedtargets,withaverybriefhold period (about 200 ms) between reaches. After training, the monkeys were surgicallyimplantedwith96-channelUtahelectrodearrays (BlackrockMicrosystems,SaltLakeCity,UT) torecordtheextracellularactivityofcorticalneurons.Inoneexperiment[19],werecordedfrombothprimarymotor cortex (M1)anddorsalpremotor cortex (PMd)and combinedneurons frombothareas.Therecordingfrommotorcortexwas21minutesandcontained164neurons.Themeanandmedianfiringrates,respectively,were6.7and3.4spikes/sec.Inanotherexperimentwerecordedfromarea2ofprimarysomatosensorycortex(S1)[41].TherecordingfromS1was51minutes,andcontained52neurons.Themeanandmedian firing rates, respectively,were9.3and6.3 spikes/sec.Frombothbrainregions,weaimedtopredictthexandycomponentsofmovementvelocity.Decoding position from the hippocampus: We used a dataset from CRCNS (CollaborativeResearchinComputationalNeuroscience),inwhichratschasedrewardsonasquareplatform(Fig.S1b)andextracellularrecordingsweremadefromlayerCA1ofdorsalhippocampus(HC)[42,43].More specifically, we used dataset “hc2” and session “ec014.333” from the repository. TherecordingfromHCwas93minutes,andcontained58neurons.Wedidnotusethefinal20%oftherecording, in which the rat had limited movement. We excluded neurons with fewer than 100spikes over the duration of the experiment, resulting in 46 neurons. The resulting neurons hadmeanandmedianfiringrates,respectively,of1.7and0.2spikes/sec.Weaimedtopredictthexandypositionoftherat.

SupplementalFigure1:TasksandDecodingSchematica)Inthetaskfordecodingfrommotorandsomatosensorycortices,monkeysmovedaplanarmanipulandumthat controlled a cursor on the screen. The monkeys continuously reached to new targets as they werepresented, with the exception of a brief hold period between reaches. b) In the task for decoding fromhippocampus,ratschasedrewardsonasquareplatform.Generaldecodingmethodsforourdemonstration:Decodingmovementvelocity fromthemotorandsomatosensorycortices:Wepredicted theaveragevelocity(xandycomponents) in50msbins.Neuralspiketrainsusedfordecodingwerealsoputinto50msbins.Inmotorcortex,weused700msofneuralactivity(13binsbeforeandtheconcurrent bin) to predict the current movement velocity, as a primary interest in the field isinvestigating how motor cortex affects movement. In somatosensory cortex, we used 650 mssurroundingthemovement(6binsbefore,theconcurrentbin,and6binsafter),asneuralactivityhasbeenshownbothprecedingandfollowingmovement[51].NotethatthebinsusedfordecodingdifferedfromthosefortheKalmanfilterandNaïveBayesdecoders(seeSpecificDecodersbelow).Also, when determining how performance varied as a function of bin size (Fig. S3), we used aslightlydifferentamountofneuraldata,inordertohaveaquantitythatwasdivisiblebymanybinsizes. In thiscase, formotorcortex,weused600msofneuralactivityprior toand including thecurrentbin.Forsomatosensorycortex,weused600msofneuralactivitycenteredonthecurrentbin.Decoding position from the hippocampus: We aimed to predict the position (x and ycoordinates)oftheratin200msbins.Neuralspiketrainsusedfordecodingwerealsoputinto200msbins.Weused2secondsofsurroundingneuralactivity(4binsbefore,theconcurrentbin,and5binsafter) topredict thecurrentposition.Note that thebinsdiffered fromabove for theKalmanfilter(seeSpecificDecodersbelow).Whendetermininghowperformancevariedasafunctionofbinsize(Fig.S3),weused2secondsofneuralactivity,centeredonthecurrentbin.ScoringMetric:ScoringMetric:Todeterminethegoodnessoffit,weused

R2 =1−yi − yi( )2

i∑

y − yi( )2i∑

,where !! yi arethepredictedvalues, !yi arethetruevaluesand !y isthemean

value. This formulation of R2 (which is the fraction of variance accounted for, rather than thesquaredPearson’scorrelationcoefficient[52])canbenegativeonthetestsetduetooverfittingon

thetrainingset.ThereportedR2valuesaretheaverageacrossthexandycomponentsofvelocityorposition.Preprocessing: The training input was normalized (z-scored). The training output was zero-centered(meansubtracted),except insupportvectorregression,where theoutputwasz-scored,whichhelpedalgorithmperformance.Thevalidation/testinginputsandoutputswerepreprocessedusingthepreprocessingparametersfromthetrainingset.Cross-validation:When determining the R2 for every method (Fig. 3), we used 10 fold cross-validation.Foreachfold,wesplitthedataintoatrainingset(80%ofdata),acontiguousvalidationset(10%ofdata),andacontiguoustestingset(10%ofdata).Foreachfold,decodersweretrainedtominimize themean squared error between the predicted and true velocities/positions of thetrainingdata.WefoundthealgorithmhyperparametersthatledtothehighestR2onthevalidationsetusingBayesianoptimization[33].Thatis,wefitmanymodelsonthetrainingsetwithdifferenthyperparametersandcalculatedtheR2onthevalidationset.Then,usingthehyperparametersthatledtothehighestvalidationsetR2,wecalculatedtheR2valueonthetestingset.Errorbarsonthetest set R2 values were computed across cross-validation folds. Because the training sets ondifferent foldswereoverlapping,computing theSEMas𝜎/ 𝑛 (where𝜎 is thestandarddeviationandn isthenumberoffolds)wouldhaveunderestimatedthesizeoftheerrorbars[53].Wethus

calculatedtheSEMas𝜎 ∗ !!+ !

!!!,whichtakesintoaccountthattheestimatesacrossfoldsarenot

independent[53].Bootstrapping:Whendetermininghowperformancescaledasafunctionofdatasize(Figs.5andS3),weusedsingle testandvalidationsets,andvaried theamountsof trainingdata thatdirectlyprecededthevalidationset.Wedidnotdothison10cross-validationfoldsduetolongrun-times.The test and validation setswere5minutes long formotor and somatosensory cortices, and7.5minutes forhippocampus.Togeterrorbars,weresampled fromthe testset.Becauseof thehighcorrelationbetweentemporallyadjacentsamples,wedidn’tresamplerandomlyfromallexamples(which would create highly correlated resamples). Instead, we separated the test set into 20temporallydistinctsubsets,S1-S20(i.e.,S1isfromt=1tot=T/20,S2isfromt=T/20tot=2T/20,etc.,whereTistheendtime),toensurethatthesubsetsweremorenearlyindependentofeachother.We then resampled combinations of these 20 subsets (e.g., S5, S13, … S2) 1000 times to getconfidenceintervalsofR2values.Implementationdetailsforspecificdecoders:WienerFilter:TheWienerfilterusesmultiplelinearregressiontopredicttheoutputfrommultipletime bins of every neurons’ spikes. That is, the output is assumed to be a linearmapping of thenumberofspikesintherelevanttimebinsfromeveryneuron(Fig.1a,b).Weusedseparatemodelstopredictthexandycomponentsofthekinematics.Wiener Cascade: The Wiener cascade (also known as a linear-nonlinear model) fits a linearregression(theWiener filter) followedbya fittedstaticnonlinearity(e.g., [35]).Thisallows foranonlinear relationship between the input and the output, and assumes that this nonlinearity ispurelya functionof the linearoutput.Here,as in theWienerFilter, the inputwasneurons’spike

ratesoverrelevanttimebins.Thenonlinearcomponentwasapolynomialwithdegreedeterminedon the validation set. Separate models were used to predict the x and y components of thekinematics.Support Vector Regression: In support vector machine regression (SVR) [36], the inputs areprojectedintoahigher-dimensionalspaceusinganonlinearkernel,andthenlinearlymappedfromthis space to the output tominimize an objective function [36].Here,weused standard supportvector regression (SVR) with a radial basis function kernel to predict the kinematics from theneurons’spikeratesineachbin.Wesethyperparametersforthepenaltyoftheerrortermandthemaximumnumberof iterations.Separatemodelswereusedtopredictthexandycomponentsofthekinematics.XGBoost: XGBoost (Extreme Gradient Boosting) [37] is an implementation of gradient boostedtrees.Tree-basedmethodssequentiallysplittheinputspaceintomanydiscreteparts(visualizedasbranchesonatreeforeachsplit),inordertoassigneachfinal“leaf”(aportionofinputspacethatisnotsplitanymore)avalueinoutputspace[38].Wefitmanyregressiontrees,whicharetreesthatpredictcontinuousoutputvalues.“Gradientboosting”referstofittingeachsubsequentregressiontreetotheresidualsofthepreviousfit.Here,weusedXGBoosttopredictthekinematicsfromtheneurons’ spike rates in each bin. We set hyperparameters for the maximum depth of the tree,numberoftrees,andlearningrate.Wefitseparatemodelstopredictthexandycomponents.FeedforwardNeuralNetwork:Afeedforwardneuralnetconnectstheinputstosequentiallayersofhiddenunits,whichthenconnect to theoutput.Each layerconnects to thenext(e.g., the inputlayertothefirsthiddenlayer,orthefirsttosecondhiddenlayers)vialinearmappingsfollowedbynonlinearities.Notethat theWienercascade isaspecialcaseofaneuralnetworkwithnohiddenlayers. Using the Keras library [54], we created a fully connected (dense) feedforward neuralnetworkwith2hiddenlayersandrectifiedlinearunit[55]activationsaftereachhiddenlayer.Werequiredthenumberofhiddenunitsineachlayertobethesame.Wesethyperparametersforthenumberofhiddenunitsinthelayers,amountofdropout[56],andnumberoftrainingepochs.Weused the Adam algorithm [57] as the optimization routine. This neural network, and all neuralnetworksbelowhadtwooutputunits.Thatis,thesamenetworkpredictedthexandycomponentstogether, rather thanseparately.The inputwasstill thenumberof spikes ineachbin fromeveryneuron.Notethatwerefertofeedforwardneuralnetworksasa“modern”technique,despitetheirhavingbeenusedformanydecades,duetotheircurrentresurgenceandthemodernmethodsfortraining.Simple RNN: In a standard recurrent neural network (RNN), the hidden state is a linearcombinationoftheinputsandtheprevioushiddenstate.Thishiddenstateisthenrunthroughanoutputnonlinearity,andlinearlymappedtotheoutput.RNNs,unlikefeedforwardneuralnetworks,allowtemporalchangesinthesystemtobemodeledexplicitly.Here,usingtheKeraslibrary[54],we created aneuralnetwork architecture inwhich the spiking inputs fromall neuronswere fedinto a standard recurrent neural network (Fig. 1c). The units from this recurrent layerwere fedthroughrectifiedlinearunitnonlinearities,andfullyconnectedtoanoutputlayerwithtwounits(xandyvelocityorpositioncomponents).Wesethyperparametersforthenumberofunits,amountofdropout,andnumberoftrainingepochs.WeusedRMSprop[58]astheoptimizationroutine.

GatedRecurrentUnit:Gated recurrentunits (GRUs) [39] are amore complex typeof recurrentneural network. It has gated units, which in practice allow for better learning of long-termdependencies. Almost all implementationmethodswere the same as for the simple RNN, exceptGatedRecurrentUnitswereusedinstead.Animplementationdifferenceisthattheunitsfromtherecurrentlayerswerefedthroughhyperbolictangent(tanh)activations(asisstandardforGRUs)ratherthanrectifiedlinearunitactivations.LongShortTermMemoryNetwork:LiketheGRU,thelongshorttermmemory(LSTM)network[40]isamorecomplexrecurrentneuralnetworkwithgatedunitsthatfurtherimprovethecaptureof long-term dependencies. The LSTM has more parameters than the GRU. All implementationmethodswerethesameasforGRUs,exceptLSTMunitswereusedinstead.Ensemble:Ensembletechniquescombinethepredictionsfromseveralmethods,andthushavethepotentialtoleveragetheirdifferentbenefits.WeusedthepredictionsfromalldecodersexcepttheKalman filter and Naïve Bayes decoders (which have different formats). We combined thepredictionsfromtheabove8methodsusingafeedforwardneuralnetwork.Thatis,the8methods’predictionswereprovidedas input intoa feedforwardneuralnetworkthatwetrainedtopredictthetrueoutput.Thiswasdoneseparatelyforthexandycomponentsofthepositionorvelocity.KalmanFilter:OurKalmanfilterforneuraldecodingwasbasedon[24].Asourimplementationisnotidentical,wewriteoutthefulldescriptionhere.IntheKalmanfilter,thehiddenstateattimetisa linear functionof thehiddenstateat time t-1,plusamatrix characterizing theuncertainty.Forneuraldecoding,thehiddenstateisthekinematics(xandycomponentsofposition,velocity,andacceleration).Notethateventhoughweonlyaimtopredictpositionorvelocity,allkinematicsareincludedbecausethisallowsforbetterprediction.Moreformally,

𝒚! = 𝑨𝒚!!! +𝒘where 𝒚! and 𝒚!!! are 6 x 1 vectors, A is a 6 x 6 matrix, andw is sampled from a normaldistribution,N(0,W),withmean0andcovarianceW.Wisthe6x6uncertaintymatrix.Theobservation (measurement) at time t* is a linear functionof thehidden state at time t (plusnoise). For neural decoding, themeasurement is the neural activity. Note thatwe allowed a lagbetweentheneuraldataandpredictedkinematics,whichiswhyweuset*forthetimeoftheneuralactivity.Moreformally,

𝒙!∗ = 𝑯𝒚! + 𝒒where 𝒙!∗ isanN x1vector,H isanN x6matrix,andq is sampled fromanormaldistribution,N(0,Q),withmean0andcovarianceQ.QistheNxNmeasurementnoisematrix.During training,A,H,W, andQ are empirically fit on the training set usingmaximum likelihoodestimation.Whenmakingpredictions, toupdate theestimatedhiddenstateatagiventimepoint,theupdatesderivedfromthecurrentmeasurementandtheprevioushiddenstatesarecombined.

Duringthiscombination,thenoisematricesgiveahigherweighttothelessuncertaininformation.See[24]orourcodefortheupdateequations(notethatxandyhavedifferentnotationin[24]).Wehadonehyperparameterwhichdifferedfromthestandardimplementation[24].Wedividedthenoisematrixassociatedwiththetransitioninkinematicstates,W,bythehyperparameterscalarC,whichallowedweightingtheneuralevidenceandkinematictransitionsdifferently. Therationalefor this addition is that neurons have temporal correlations, which make it desirable to have aparameter that allows changing theweight of the new neural evidence. The introduction of thisparametermadeabigdifferenceforthehippocampusdataset(Fig.S4).Wealsoallowedfora lagbetweentheneuraldataandpredictedkinematics.Thelagandhyperparameterweredeterminedbasedonvalidationsetperformance.Naïve Bayes: The Naïve Bayes decoder is a type of Bayesian decoder that determines theprobabilities of different outcomes, and it then predicts the most probable. Briefly, it fits anencodingmodeltoeachneuron,makesconditionalindependenceassumptionsaboutneurons,andthen uses Bayes’ rule to create a decoding model from the encoding models. This probabilisticframeworkcanincorporatepriorinformationaboutthevariables.WeusedaNaïveBayesdecodersimilar to theone implemented in [5].Aswith theKalman filter,becausethisisnotanout-of-the-boxmethodthatweapplied,wewritethefulldetailshere.Wefirstfitanencodingmodel(tuningcurve)usingtheoutputvariables.Letfi(s)bethevalueofthetuningcurve(theexpectednumberofspikes)forneuroniattheoutputvariabless.Notethatsisavectorcontaining the twooutputvariableswearepredicting(xandypositions/velocities). Weassumethenumberofrecordedspikesinthegivenbin,ri,isgeneratedfromthetuningcurvewithPoissonstatistics:

𝑃 𝑟! 𝒔 =exp [−𝑓!(𝒔)]𝑓!(𝒔)!!

𝑟!!

Wealsoassumethatalltheneurons’spikecountsareconditionallyindependentgiventheoutputvariables,sothat:

𝑃 𝒓 𝑠 ∝ 𝑃 𝑟! 𝑠!

whererisavectorwiththespikecountsofallneurons.Bayes’rulecanthenbeusedtodeterminethelikelihoodoftheoutputvariablesgiventhespikecountsofallneurons:

𝑃 𝒔 𝒓 ∝ 𝑃 𝒓 𝒔 𝑃(𝒔)whereP(s)istheprobabilitydistributionoftheoutputvariables.Tohelpwithtemporalcontinuityofdecoding,wewantourprobabilisticmodeltoincludehowtheoutputvariablesatonetimestep

dependontheoutputvariablesattheprevioustimestep:𝑃 𝒔! 𝒔!!! .Thus,wecanmoregenerallywrite,usingBayes’ruleasbefore:

𝑃 𝒔! 𝒓!∗, 𝒔!!! ∝ 𝑃 𝒓!∗ 𝒔! 𝑃 𝒔!!! 𝒔! 𝑃 𝒔! Note thatweuse𝒓!∗rather than𝒓!becauseweuseneural responses frommultiple timebins topredictthecurrentoutputvariables.Theaboveformulaassumesthat𝒓!∗and𝒔!!!areindependent,conditionedon𝒔! .Thefinaldecodedstimulusinatimebinis:argmax𝒔! 𝑃 𝒔! 𝒓!∗, 𝒔!!! .𝑃 𝒔!!! 𝒔! wasdeterminedasfollows.LetΔ𝒔betheEuclideandistanceinsfromonetimesteptothenext.Wefit𝑃(Δ𝒔)asaGaussianusingdatafromthetrainingset.𝑃 𝒔!!! 𝒔! wasapproximatedas𝑃(Δ𝒔!).Thatis,theprobabilityofgoingfromoneoutputstatetoanotherwasonlybasedonthedistancebetweentheoutputstates,nottheoutputstateitself.Additionally,including𝑃 𝒔 basedonthedistributionofoutputvariablesinthetrainingsetdidnotimprove performance on the validation set. This could be because the probability distributiondiffered between the training and validation/testing sets, or because the distribution of outputvariableswasapproximatelyuniforminourtasks.Thus,wesimplyusedauniformprior.Inourcalculations,wediscretizes intoa100x100gridgoingfromtheminimumtomaximumoftheoutputvariables.Whenincreasingthedecodingresolutionoftheoutputvariables,wedidnotseeameaningfulchangeindecodingaccuracy.OurtuningcurveshadtheformatofaPoissongeneralizedquadraticmodel[59],whichimprovedtheperformanceovergeneralizedlinearmodelsonvalidationdatasets.Onthehippocampusdataset,weusedthetotalnumberofspikesoverthesametimeintervalasweusedfortheotherdecoders(4binsbefore,theconcurrentbin,and5binsafter).Notethatusingasingle timebin of spikes led to very poor performance.On themotor cortex and somatosensorycortexdatasets,thenaïveBayesdecodergaveverypoorperformanceregardlessofthebinsused.Weultimatelyusedbinsthatgavethebestperformanceonavalidationset:2binsbeforeandtheconcurrentbinforthemotorcortexdataset;1binbefore,theconcurrentbin,and1binafterforthesomatosensorycortexdataset.

AdditionalDemonstrationResults

SupplementalFigure2:ExampleresultswithlimitedtrainingdataUsingonly2minutesoftrainingdataformotorcortexandsomatosensorycortex,and15minutesoftrainingdata for hippocampus, we trained two traditional methods (Wiener filter and Kalman filter), and twomodernmethods(feedforwardneuralnetworkandLSTM).Exampledecodingresultsareshownfrommotorcortex(left), somatosensorycortex(middle),andhippocampus(right), for thesemethods(top tobottom).Groundtruthtracesareinblack,whiledecoderresultsareinthesamecolorsaspreviousfigures.

VaryingbinsizesWhile all demonstrations in the main text only tested decoding accuracy using a set bin size,different decoding applications may require different temporal resolutions. It is therefore alsoimportanttocomparedecodingaccuracywithvaryingbinsizes.Forthemotorandsomatosensorycortexdatasets,we testedbinsizes from10-100ms.For thehippocampusdataset,we testedbinsizes from 30-400 ms. Surprisingly, for all methods besides the Kalman filter, decodingperformance remained consistent across all the tested bin sizes (Fig. S3). For the Kalman filter,decodingperformanceincreasedasbinsizesbecamelarger,likelybecausetheKalmanfilterusedasinglebinofneuraldatatomakepredictions(unliketheothermethodsthatusedmanybins),andsmallbinsizescouldthusleadtoinsufficientneuraldatatopredictbehaviorwell.Importantly,forall bin sizes, themodernmethodshadbetterdecodingaccuracy.Thus,modernmachine learningmethodsremainadvantageousregardlessofthetemporalresolution.

SupplementalFigure3:DecoderresultswithdifferentbinsizesUsing varying bin sizes, we trained two traditional methods (Wiener filter and Kalman filter), and twomodernmethods(feedforwardneuralnetworkandLSTM).WeusedthesametestingsetasinFig.5,andthelargest trainingset fromFig.5.R2valuesarereportedforthesedecoders(differentcolors) foreachbrainarea(toptobottom).Errorbarsare68%confidence intervals(meant toapproximate theSEM)producedviabootstrapping,asweusedasingletestset.

SupplementalFigure4.KalmanFilterVersionsR2valuesarereportedfordifferentversionsoftheKalmanFilterforeachbrainarea(toptobottom).Ontheleft(inbluishgray),theKalmanFilterisimplementedasin[24].Ontheright(incyan),theKalmanFilterisimplemented with an extra parameter that scales the noise matrix associated with the transition inkinematicstates(seeDemonstrationMethods).Thisversionwiththeextraparameteristheoneusedinthemaintext.Errorbarsrepresentthemean+/-SEMacrosscross-validationfolds.X’srepresenttheR2valuesofeachcross-validationfold.Notethedifferenty-axislimitsforthehippocampusdataset.

AdditionalDiscussionofMain-textDemonstrationsWe find it particularly interesting that theneural networkmethodsworked sowellwith limiteddata,whichiscountertothecommonperception.Webelievetheexplanationissimplythesizeofthe networks. For instance, our networks have on the order of 105 parameters, while commonnetworks for image classification (e.g., [60]) can have on the order of 108 parameters. Thus, thesmallersizeofournetworks(hundredsofhiddenunits)mayhaveallowedforexcellentpredictionwithlimiteddata[61].Moreover,thefactthatthetasksweusedhadalow-dimensionalstructure,and therefore the neural data was also likely low dimensional [62], might allow high decodingperformancewithlimiteddata.ItisalsointriguingthatthefeedforwardneuralnetworkdidalmostaswellastheLSTMandbetterthanthestandardRNN,consideringtherecentattentiontotreatingthebrainasadynamicalsystem[63]. For the motor and somatosensory cortex decoding, it is possible that the highly trainedmonkeysyieldedastereotypedtemporalrelationshipbetweenneuralactivityandmovementthatafeedforward neural network could effectively capture. It would be interesting to compare theperformanceoffeedforwardandrecurrentneuralnetworksonlessconstrainedbehavior.In order to find the best hyperparameters for the decoding algorithms, we used a Bayesianoptimizationroutine[33]tosearchthehyperparameterspace(seeDemonstrationMethods).Still,itispossiblethat,forsomeofthedecodingalgorithms,thehyperparameterswerenonoptimal,whichwould have lowered overall accuracy. Moreover, for several methods, we did not attempt tooptimizeallthehyperparameters.Wedidthisinordertosimplifytheuseofthemethods,decreasecomputational runtime during hyperparameter optimization, and because optimizing additionalhyperparameters(beyonddefaultvalues)didnotappeartoimproveaccuracy.Forexample,fortheneuralnetsweuseddropoutbutnotL1orL2regularization,and forXGBoostweoptimized lessthan half the available hyperparameters designed to avoid overfitting. While our preliminarytesting with additional hyperparameters did not appear to change the results significantly, it ispossiblethatsomemethodsdidnotachieveoptimalperformance.

Documents

Machine learning for neural decoding - arXiv · We will describe a number of specific methods suitable for neural decoding later in this tutorial. Ultimately, it is often beneficial