12
General AI Challenge Round One: Gradual Learning Jan Feyereisl, Matˇ ej Nikl, Martin Poliak, Martin Str´ ansk´ y, Michal Vlas ´ ak GoodAI, Prague, Czech Republic * Abstract The General AI Challenge is an initiative to encour- age the wider artificial intelligence community to focus on important problems in building intelligent machines with more general scope than is currently possible. The challenge comprises of multiple rounds, with the first round focusing on gradual learning, i.e. the ability to re-use already learned knowledge for efficiently learning to solve sub- sequent problems. In this article, we will present details of the first round of the challenge, its in- spiration and aims. We also outline a more formal description of the challenge and present a prelim- inary analysis of its curriculum, based on ideas from computational mechanics. We believe, that such formalism will allow for a more principled ap- proach towards investigating tasks in the challenge, building new curricula and for potentially improv- ing consequent challenge rounds. 1 Introduction Existing artificial intelligence (AI) algorithms available today constitute the so-called narrow AI landscape, meaning that they have been designed, trained, and optimized by human engineers to solve a single, specific task or a very narrow collection of closely related problems. Although such al- gorithms sometimes outperform humans in their established skill-set, they are not able to extend their capabilities to new domains. This limits their re-usability, potentially increases the amount of data required to train them, and leaves them lacking generality and extensibility to higher order reasoning. In contrast, algorithms capable of overcoming these limita- tions could eventually converge towards a repertoire of func- tionalities akin to a human-level skill-set. Such algorithms might be able to learn to come up with creative solutions for a wide range of multi-domain tasks. Systems employ- ing aforementioned algorithms, oftentimes termed under the umbrella of general AI systems, are viewed by many as the ultimate leverage in solving many of humanity’s direst prob- lems [Bostrom, 2014; Domingos, 2015; Stone et al., 2016]. * Correspondence: {jan.feyereisl, matej.nikl, martin.poliak, mar- tin.stransky, martin.vlasak}@goodai.com hel l o wor l d hel l o wor l d hi h abzykvvr r r r bavccccccr r r dcr r r - - - - - - - ++++- - - +++++++++- ++++ R: A: E: SDP A. B. C. ... ... Curriculum Figure 1: Overview of the first round of the General AI Challenge. The structure of the challenge is hierarchical and multi-objective, and requires learning at multiple levels to solve a curriculum C. From fast learning (A.) at individual instance level of each task, where a ‘good enough’ policy π i for a stochastic decision process (SDP) must be found, through shared sub-goal discovery (B.) across instances τ 1 ,...,τ k leading to one-shot or few-shot learning, all the way to lower level knowledge and policy transfer (C.) across differ- ent tasks T i for speeding up convergence towards solutions on yet unseen tasks. What agent sees is shown at the bottom, i.e. con- tinual streams of data with no instance or task separation inform- ation, where E denotes the environment stream, A the agent’s re- sponse and R the reward. Gradual learning is an ability of learning systems to learn in a gradual manner. This is likely a necessary pre-condition for acquiring such a broad set of skills, striving to solve a wide range of disparate problems [Rosa et al., 2016; Ku- maran et al., 2016; Tommasino et al., 2016; Rusu et al., 2016; Balcan et al., 2015; Pentina and Lampert, 2015; Ruvolo and Eaton, 2013; Hamker, 2001]. Unfortunately, most existing al- gorithms struggle to deal with many tasks with different dis- arXiv:1708.05346v1 [cs.AI] 17 Aug 2017

General AI Challenge Round One: Gradual Learning

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: General AI Challenge Round One: Gradual Learning

General AI ChallengeRound One: Gradual Learning

Jan Feyereisl, Matej Nikl, Martin Poliak, Martin Stransky, Michal VlasakGoodAI, Prague, Czech Republic∗

AbstractThe General AI Challenge is an initiative to encour-age the wider artificial intelligence community tofocus on important problems in building intelligentmachines with more general scope than is currentlypossible. The challenge comprises of multiplerounds, with the first round focusing on graduallearning, i.e. the ability to re-use already learnedknowledge for efficiently learning to solve sub-sequent problems. In this article, we will presentdetails of the first round of the challenge, its in-spiration and aims. We also outline a more formaldescription of the challenge and present a prelim-inary analysis of its curriculum, based on ideasfrom computational mechanics. We believe, thatsuch formalism will allow for a more principled ap-proach towards investigating tasks in the challenge,building new curricula and for potentially improv-ing consequent challenge rounds.

1 IntroductionExisting artificial intelligence (AI) algorithms available todayconstitute the so-called narrow AI landscape, meaning thatthey have been designed, trained, and optimized by humanengineers to solve a single, specific task or a very narrowcollection of closely related problems. Although such al-gorithms sometimes outperform humans in their establishedskill-set, they are not able to extend their capabilities to newdomains. This limits their re-usability, potentially increasesthe amount of data required to train them, and leaves themlacking generality and extensibility to higher order reasoning.

In contrast, algorithms capable of overcoming these limita-tions could eventually converge towards a repertoire of func-tionalities akin to a human-level skill-set. Such algorithmsmight be able to learn to come up with creative solutionsfor a wide range of multi-domain tasks. Systems employ-ing aforementioned algorithms, oftentimes termed under theumbrella of general AI systems, are viewed by many as theultimate leverage in solving many of humanity’s direst prob-lems [Bostrom, 2014; Domingos, 2015; Stone et al., 2016].

∗Correspondence: jan.feyereisl, matej.nikl, martin.poliak, mar-tin.stransky, [email protected]

hel l o wor l d hel l o wor l d hi h abzykvvr r r r bavccccccr r r dcr r r- - - - - - - ++++- - - +++++++++- ++++R:

A:

E:

SDP

A.

B.

C.

Task

sIn

stan

ces

Age

nt's

vie

wC

halle

nge

Stru

ctur

e

...

...

Curriculum

Figure 1: Overview of the first round of the General AI Challenge.The structure of the challenge is hierarchical and multi-objective,and requires learning at multiple levels to solve a curriculum C.From fast learning (A.) at individual instance level of each task,where a ‘good enough’ policy πi for a stochastic decision process(SDP) must be found, through shared sub-goal discovery (B.) acrossinstances τ1, . . . , τk leading to one-shot or few-shot learning, all theway to lower level knowledge and policy transfer (C.) across differ-ent tasks Ti for speeding up convergence towards solutions on yetunseen tasks. What agent sees is shown at the bottom, i.e. con-tinual streams of data with no instance or task separation inform-ation, where E denotes the environment stream, A the agent’s re-sponse and R the reward.

Gradual learning is an ability of learning systems to learnin a gradual manner. This is likely a necessary pre-conditionfor acquiring such a broad set of skills, striving to solve awide range of disparate problems [Rosa et al., 2016; Ku-maran et al., 2016; Tommasino et al., 2016; Rusu et al., 2016;Balcan et al., 2015; Pentina and Lampert, 2015; Ruvolo andEaton, 2013; Hamker, 2001]. Unfortunately, most existing al-gorithms struggle to deal with many tasks with different dis-

arX

iv:1

708.

0534

6v1

[cs

.AI]

17

Aug

201

7

Page 2: General AI Challenge Round One: Gradual Learning

tributions at the same time and tasks that drift in distributionfrom one another [Ditzler et al., 2015] and frequently exhibitcatastrophic forgetting, once applied to new problems, espe-cially under realistic computational constraints [Rusu et al.,2016; Fernando et al., 2017; Hamker, 2001; Kirkpatrick et al.,2016; French, 1999; Gershman et al., 2015]. Gradual learn-ing is therefore a difficult and unsolved problem that requiresa focused investigation by the community. The General AIChallenge provides a platform for such directed focus.

2 Round One: Gradual LearningThe aim of the first round of the challenge is to build anagent that exhibits gradual learning [Rosa et al., 2016] asevidenced by solving a curriculum of increasingly complexand disparate sets of tasks. Despite the similarity in nameto gradual learning in [Kumaran et al., 2016] and other re-lated concepts such as cumulative [Tommasino et al., 2016],incremental [Rusu et al., 2016], life-long [Balcan et al.,2015; Pentina and Lampert, 2015; Ruvolo and Eaton, 2013;Hamker, 2001] or continual learning [Kirkpatrick et al.,2016], in our setting, the concept encompasses a broader classof requirements that an agent needs to satisfy. These include:

1. Exploiting previously learned knowledge

2. Fast adaptation to unseen problems

3. Avoiding catastrophic forgetting

The above requirements are expressed in two primary ob-jectives of the first round, each with its own set of evaluationcriteria:

Objective 1 (Quantitative) Agent capable of passing an eval-uation curriculum in the shortest number of simulation steps.Passing requires the ability to exploit previously learnedknowledge and avoid catastrophic forgetting, all within a pre-defined time limit.

Alternatively, a description of a conceptual method forachieving gradual learning is also sought:

Objective 2 (Qualitative) An idea, concept, or design thatshows the best promise for scalable gradual learning. A work-ing AI agent is desirable but not necessary.

The fundamental focus of the challenge is on gradual learn-ing, i.e. on the efficient re-use of already gained knowledge.Competitors are required to develop solutions with the fol-lowing properties:

Property 1 (Gradual Learning) An ability to re-use previ-ously gained knowledge for the acquisition of subsequentknowledge to more efficiently solve hitherto new and yet un-seen problems, while avoiding catastrophic forgetting.

It is important to note here not only the re-use of ex-isting knowledge, but also the focus on efficiency of solv-ing new and unseen problems. Unlike compositionality orother concepts similar to gradual learning, our definition tran-scends the boundaries of meta-learning [Duan et al., 2017b;Wang et al., 2016; Duan et al., 2017a; Santoro et al., 2016;Chen et al., 2016b; Chen et al., 2016a]. We seek agents thatare able to learn to quickly adapt to new and unseen tasks,

while exploiting the gradual structure of the underlying prob-lems they are solving. One can think of the above as that theagents need to possess the ability to search for more efficientstrategies to solve new problems.

Naturally, when solving new problems, agents must be ableto retain the ability to solve old tasks. In many systems, thelack of such ability causes catastrophic forgetting [French,1999], which must be avoided:

Property 2 (Avoiding Catastrophic Forgetting) Avoiding theloss of information, relevant for one task, due to the incorpor-ation of knowledge necessary for a new task.

In summary, the aim is not optimizing for agent’s perform-ance of existing skills, i.e. how good an agent is at deliveringsolutions to problems it has already encountered. Instead, wedesire optimizing for agent’s performance on solving new andunseen problems, i.e. maximizing the speed of convergenceto ‘acceptable solutions’ on new problems, while exploitingexisting knowledge and ensuring its survival during the ac-quisition of new information.

3 BackgroundTo fully appreciate our setting, below we provide backgroundon why graduality can be beneficial and how it can be en-couraged. This is followed by a more formal description ofthe challenge requirements, environment and evaluation pro-cedures.

3.1 Benefits of GradualityGiven a complex task that needs to be solved, frequentlya good strategy for finding a solution is to break the prob-lem down into smaller problems which are easier to dealwith. The same applies to learning [Bengio et al., 2009;Salakhutdinov et al., 2013]. It can be much faster to learnthings gradually than to try to learn a complex skill fromscratch [Alexander and Brown, 2015; Zaremba and Sut-skever, 2015; Gulcehre et al., 2016; Oquab et al., 2014].One example of this is the hierarchical decomposition of atask into subtasks and the gradual learning of skills neces-sary for solving each of them [Krueger and Dayan, 2009;Vezhnevets et al., 2017; Lee et al., 2017; Andreas et al.,2017], progressing from the bottom of the hierarchy tothe top [Mhaskar et al., 2016; Mhaskar and Poggio, 2016;Poggio et al., 2015; Polya, 2004]. A prime example in thenatural world is the gradual acquisition of motor skills by in-fants during their first years of life [Adolph and Franchak,2017].

3.2 Guidance through CurriculaExploiting graduality during learning is clearly beneficial.Building systems that learn in a gradual manner should thenbe encouraged and its benefits and limitations explored fur-ther. To enable such type of learning, one can control andguide the learning process. Guided learning, also called cur-riculum learning, [Gulcehre and Bengio, 2016; Bengio et al.,2009; Vapnik, 2015] provides control by means of presentingthe learner with parts of the problem in the order and extentthat is likely to be most beneficial at that point in time. One

Page 3: General AI Challenge Round One: Gradual Learning

method of providing such order is in the form of a learningcurriculum [Bengio et al., 2009], akin to a curriculum used inschools. In this scenario, easier topics are taught before morecomplex ones, in order to exploit the gradual nature of taughtknowledge.

Gradual and guided learning has a number of other be-nefits over other types of learning [Gulcehre et al., 2016;Gulcehre and Bengio, 2016; Bengio et al., 2009; Pan andYang, 2010]. For example, optimizing a model that hasfew parameters and gradually building up to a model withmany parameters could be more efficient than starting with amodel that has many parameters from the beginning [Stanleyand Miikkulainen, 2002]. In this case, a smaller number ofnew parameters is learned at each step [Chen et al., 2015;Kirkpatrick et al., 2016; Andreas et al., 2016b; Rusu etal., 2016]. This might also result in reducing the necessityfor exploration. Furthermore, apriori knowledge of the sys-tem’s architecture might not be necessary [Zhou et al., 2012;Rusu et al., 2016; Ganegedara et al., 2016], and architec-ture size can be dynamically derived during training to cor-respond to the complexity of given problems [Fahlman andLebiere, 1990]. Last but not least, reuse of already learnedskills is feasible and encouraged [Andreas et al., 2016b;Andreas et al., 2016a]. Once a skill is acquired, it is no longerrelevant how long the skill took to discover. The cost of usingan existing skill is notably smaller than searching for a skillfrom scratch.

4 Learning to Gradually LearnHaving described the objectives of the challenge and estab-lished the benefits of gradual learning through a curriculum,we will now focus on formal definitions and descriptions ofthe requirements of the first round of the challenge.

4.1 Instances, Tasks & CurriculaThe gradual learning ability of an agent is evaluated by sub-jecting an agent to a sequence of n tasks T1,T2, . . . ,Tn ofincreasing complexity. Such sequence is called a curriculum:

Definition 1 (Curriculum) A curriculum C = (T1, . . . ,Tn)is an n-tuple of tasks of increasing complexity. Tasks are en-dowed with an arbitrary measure of complexity.

In section 8.3, we present one possible way of measuringtask complexity in a principled manner and to ensure properordering of curricula. It can be performed with the help of ameasure from the field of computational mechanics, namelystatistical complexity Cµ [Crutchfield, 1994].

Each task is observed by an agent through task instancesτ ∼ T. The publicly available curriculum provided as part ofthe challenge [Poliak et al., 2017] can be seen as a curriculumof distributions over stochastic decision problems (SDPs). Inparticular, a variant of partially observable Markov decisionprocesses (POMDPs), or more generally partially observablestochastic games (POSGs).

Definition 2 (Task and Instance) A task T is a distributionover a set of Partially Observable Markov Decision Pro-cesses. An instance of a task τ ∼ T is a sample from saiddistribution.

In other words, a single instance τ of a task T is a POMDP.A POMDP is an 8-tuple (S,A,Ω,P,O, r, γ, T ) where S isa set of states, A a set of actions, Ω a set of observations,P ∶ S × A × S → [0, 1] a transition probability distribu-tion, O ∶ S × A × Ω → [0, 1] is an observation probab-ility distribution, r ∶ S × A → R a reward function, γ adiscount factor and T a horizon. In the standard POMDPformulation, the goal of an agent is to maximize its futurediscounted reward Eσ [∑T

t=0 γtr(st, at)], where the expect-

ation is over the sequence of agent’s belief state/action pairsσ = ((b0, a0), . . . , (bT , aT )), where b0(s) = P(S = s) is aninitial belief state, at ∼ πθ(at∣bt), bt+1 = P(bt+1∣bt, at, ot)and πθ denotes a policy parameterized by θ.

Unlike in the standard setting however, the goal in the chal-lenge round is different and distributed across multiple levelsof hierarchy. We are not interested in maximizing the agent’sfuture discounted reward, but rather a more complex set ofobjectives at different scales. For example at the instancelevel, it is sufficient to find an acceptable solution to eachinstance τi, as described in Algorithm 1. Figure 1 shows thethree levels of hierarchy present in the challenge:

1. Fast learning (A): Policy Search − At the individualinstance level, the agent is required to find a solution(e.g. a policy π) to a single POMDP/task instance τ .

2. Slow learning (B): Meta-Policy Discovery − The agentneeds to discover a meta-strategy (e.g. a meta-policyπmeta) that quickly converges to a solution across all

instances τi.3. Fast Adaptation (C): Policy Transfer − The agent is

required to exploit existing acquired knowledge andpolicies in order to adapt to each new task in a cur-riculum, to solve it faster.

Whether it is possible to find a correspondence between thechallenge objectives and the standard reward formulation re-mains to be seen and up to competitors to determine.

In addition to the above hierarchy, a number of objectivesand constraints need to be satisfied by competing agents. Theprimary goal of an agent is the completion of the entire cur-ricula C in as short a time as possible.

Definition 3 (Quantitative Objective) Successful completionof an evaluation curriculum Ce in the shortest number oftime-steps possible among all competitors and within 24hours from the start of the evaluation process on predefinedevaluation H/W. Given a set of competing agents A, the fast-est agent is determined according to:

arg minα∈A

%(α,Ce)

where % ∶ A × C → N+ corresponds to RunCurriculum()in Algorithm 2 which returns the number of simulation stepsit took an agent α to successfully complete a curriculum C.

Satisfying the condition set forth in Definition 3, the agentalso has to show that two additional conditions are met,namely gradual learning and avoiding catastrophic forgetting.

Definition 4 (Gradual Learning) A manifestation of Prop-erty 1, exhibited through a reduced number of computational

Page 4: General AI Challenge Round One: Gradual Learning

steps required when solving a task T ensuing the solving ofprevious tasks, i.e. %(α(T1,T2), (T3)) < %(α, (T3)), whereα(T1,T2) denotes agent α that has already learned to solvetasks T1 and T2.

The above condition does not ensure that gradual learningtruly occurs, nevertheless it provides sufficient evidence thatan agent improves its performance on subsequent tasks, hav-ing already solved other tasks before.

Definition 5 (Avoiding Catastrophic Forgetting) A manifest-ation of Property 2 through the maintenance of fast conver-gence to an acceptable solution for already solved tasks, i.e.%(α(T1,T2,T3), (T2)) ≤ c%(α(T1,T2), (T2)) where c is someconstant of the agent.

This condition ensures that learning to solve a new taskdoes not impair the ability to solve previously encounteredtasks.

4.2 Validation CurriculaTo test generalization of gradual learning, an agent trainedon a single curriculum C cannot be effectively evaluated ontasks from the same curriculum. A different curriculum is ne-cessary. Curricula appropriate for a gradually learning agentalso form a distribution C from which a training and an evalu-ation curriculum should be drawn. The mini and micro tasksused in the challenge (described in section 6) form togetherone sample from this distribution. Using a single sample cur-riculum C ∼ C to estimate the distribution C might be toodifficult. It is possible that competitors might need to createtheir own curricula. Examples of a number of possible finaltasks from alternative curricula were shown in [Rosa, 2017].

5 EvaluationAs mentioned previously, simply maximizing reward R isneither sufficient, nor desired. Immediately after the agentreaches an acceptable performanceR∗ on an instance, the en-vironment presents it with the next sample τ . Upon the suc-cessful completion of a number of instances from the sametask, determined according to Algorithm 1, the environmentpresents the agent with the next task in the curriculum. Thiscan be seen in Algorithm 2.

The EnvStep and AgentStep functions in Algorithm1 update the environment and the agent respectively, whileexchanging reward, input and output. Together, they form thecore of the environment-agent communication loop, depictedin Figure 2.

The helper functions SoftLimit and HardLimit areused to compute the respective time step limits for a given in-stance. As the naming suggests, there are two types of limitsconsidered when running an instance: a soft limit, and a hardlimit. Both limits are represented in terms of a number ofenvironment steps. They are dynamically computed as theydepend on the current instance − in fact, they can even evolveas the agent progresses through the instance.

The soft limit can be thought of as a success limit. Theagent is required to solve the instance within this limit in or-der for this attempt to count as successful. If it fails to do so,

Algorithm 1: Progression through one task instance τinput : An agent α, a task instance τi, the minimum

number of positive rewards R∗, uninterrupted bya negative one, required to solve current instance

output: The number of steps t taken to solve τi.

function Solve(α, τi, R∗)

t← 0; # Counter of elapsed stepsR+ ← 0; # Counter of positive rewardsR, ot, τi ← EnvStep(τi, ∅); # Environment

# Agent-environment loop until reward/time-limit reachedrepeat

α, at ← AgentStep(α, R, ot); # AgentR, ot, τi ← EnvStep(τi, at); # Environmentt← t + 1; # Increment elapsed steps

# If correct, increment reward counter, else re-set itif R = 1 then

R+ ← R+ + 1; # Increment reward counterelse if R = −1 then

R+ ← 0; # Reset positive reward counterend

until R+ = R∗ or t = HardLimit(τi);

return t; # No. of steps taken to solve instance τiend

the prospective solution will no longer be considered as suc-cessful. The instance, however, does not end yet. This allowsthe agent to continue exploring and gather some additionaluseful knowledge about the unsolved problem.

A forceful instance termination comes only when the hardlimit is reached. This behavior is beneficial in situationswhere an agent gets stuck in a particular instance.

The reasoning behind such limits is as follows: the agenthas no way of communicating its needs, for example the needto change the instance to a different one, or switching to theprevious task. Therefore, those limits can be thought of assupplement to such cases.

6 Learning EnvironmentParticipants are provided with an environment and a publiccurriculum of tasks of increasing complexity [Poliak et al.,2017]. Both, the environment and the associated tasks arespecifically constructed to minimize distraction from unre-lated research problems and to primarily focus on graduallearning and associated obstacles. The environment and tasksare built on top of the CommAI-env [Mikolov et al., 2015;Baroni et al., 2017].

Participants are asked to develop and train their agentson the public curriculum and possibly any other additionalknowledge or data sources that they have at their disposal.The public set of tasks provides a reference example of howthe evaluation curriculum might be structured and the type oftasks that will be required to be solved by an agent. Taskscome in two groups, as simple micro-tasks and as more ad-vanced mini-tasks. It is important to note, that natural lan-guage processing is not necessary to solve the tasks.

Page 5: General AI Challenge Round One: Gradual Learning

Algorithm 2: Progression through a curriculum Cinput : An agent α, a curriculum C and the number of

successfully solved consecutive instances Nsrequired

output: The total number of steps T the agent needed tosolve curriculum C

function RunCurriculum(α, C, Ns)T ← 0; # Counter of elapsed steps# Solve each task T in a curriculum Cforeach T in C do

Nj ← 0; # Counter of consecutive successesRj ← ReqReward(T); # Required min. reward# Successfully solve Ns instances of task Trepeat

τi ∼ T; # Sample a new instanceti ← Solve(α, τi, Rj); # Solve instanceT ← T + ti; # Increment elapsed stepsts ← SoftLimit(τi); # Calc. soft limit

# If solved sufficiently quicklyif ti ≤ ts then

Nj ← Nj + 1; # increment success counterelse

Nj ← 0; # else reset counterend

until Nj = Ns;endreturn T ; # No. of steps taken to solve C

end

6.1 Mini-tasksMini-tasks are based directly on the CommAI-mini taskset [Baroni et al., 2017], with focus on simple grammarprocessing and deciding language membership problems forstrings. The mini-tasks are simple for an educated and biasedhuman, but they are tremendously complex for an unbiasedagent that receives only sparse reward. For this reason, webelieve that simpler tasks are necessary. We refer to those asmicro-tasks.

6.2 Micro-tasksThe purpose of micro-tasks is to allow the agent to acquirea sufficient repertoire of prior knowledge, necessary for dis-covering the solutions to the more complex mini-tasks in areasonable time, under the assumption that the agent is cap-able of gradual learning.

Decomposing mini-tasks into simpler problems yields anumber of skills that the agent should learn first. For eachsuch skill, there is a separate micro-task. The micro-tasksbuild on top of each other in a gradual and compositional way,relying on the assumption that the agent can make use of themefficiently.

One could argue that the micro-tasks (and even the mini-tasks) are too simple and can be solved by hard-coding the ne-cessary knowledge into the agent. This is indeed true. How-ever, such a solution is not desirable and goes against thespirit of the challenge - to learn new skills from data. The

Reward out Data out

Reward in Data in

Action out

Action in

Boundary between simulation steps

Reward

Data

Action

Environment

Agent

RANGE:a)

h e l l . . . .a b z y . . . .- - - - . . . .R:

A:

E:b)

Time

Figure 2: a) Interface between the agent and the environment.b) Depiction of environment output with overlaid simulationorder.

challenge will prevent any hard-coded solution by using hid-den evaluation tasks which are different from the public cur-riculum. A hard-coded solution that does not exhibit graduallearning should fail on such evaluation tasks.

6.3 Environment-Agent InterfaceThe interface between the agent and the environment is de-picted in Figure 2. The environment sends reward −1, 0, 1and data (one byte) to the agent and receives the agent’s ac-tion (one byte) in response. This happens in a continuouscycle. During a single simulation step, the environment pro-cesses the received action from the agent and sends rewardwith new data to the agent; the agent processes this input andsends an action back to the environment.

Specifically, the environment sends 8 bits of data to theagent at every simulation step and receives 8 bits of data backfrom the agent. This is different from the original CommAI-env [Mikolov et al., 2015], which sends only a single bit in-stead of a full byte every time. Although this makes the inter-face slightly less flexible, the agent’s communication shouldbe more interpretable and it should reduce the complexity ofthe presented problems.

Note that the environment is not sending any special in-formation about what task the agent is in at any point in time.The environment appears as a continuous stream of bytes (andrewards) to the agent.

7 DiscussionLooking back at Figure 1, it is apparent that there is a signi-ficant gap between the structure of the curriculum, the object-ives to be learned within, and what kind of information theagent receives. Most notably, there is no explicit informationabout a successful completion of an instance or a task. Simil-arly, no indication of an instance or a task switch is provided.The lack of information and limited amount of feedback thatan agent has at its disposal makes this round challenging anddifferent from most other learning problems.

Moreover, unlike standard POMDPs, for many of the tasksin the public curriculum, the reward is adaptive. This canbe very challenging due to the drifting nature of the reward

Page 6: General AI Challenge Round One: Gradual Learning

function. Another challenging aspect of the first round isthe inability to request previous tasks from the environment.Hence, competing agents need to be sample efficient as wellas ‘epoch’ efficient.

8 Task Complexity via ComputationalMechanics

Computational mechanics [Crutchfield, 2011], a subfield ofphysics, is concerned with exploring the ways in which natureperforms computations. How to identify structure in naturalprocesses and how to measure it, as well as how informationprocessing is embedded in dynamical behavior. The field of-fers a number of useful tools for thinking about and quanti-fying complex systems, some of which are relevant for build-ing intelligent machines. Namely, we are interested in theconcept of ε-machines [Crutchfield and Young, 1989] and theassociated measure of statistical complexity Cµ.

In our scenario, these concepts and tools can be useful forexploring curricula of the challenge, from the point of viewof the complexity of each individual task and the ability tobegin investigating the possibility of automating the creationof more principled curricula for both training and evaluation.

8.1 ε-Machines as Minimal Optimal ModelsThe concept of ε-machines as a class of minimal optimal pre-dictors was developed for quantifying structure in a stochasticprocess via the reconstruction of underlying causal states in anatural process [Crutchfield, 1994]. In their most commonform, they can be thought of as a type of hidden Markovmodel (HMM), whose states have a unique, causal definition.ε-machines are, however, not limited to only a HMM repres-entation and can in fact be used at various levels of represent-ation hierarchy, depending on whether current representationhits the limits of the agent’s computational resources [Crutch-field, 1994].

An ε-machine can be constructed with the help of an equi-valence relation

←−y ∼ε←−y′⟺ P (−→Y ∣←−Y =

←−y ) = P (−→Y ∣←−Y =←−y′) (1)

where Y denotes a discrete random variable,←−Y and

−→Y a block

of past (e.g.←−Yt = . . . Yt−2Yt−1 ) and future random vari-

ables, respectively, with ←−y and −→y denoting a particular his-tory or a future (a sequence of symbols from alphabet Y) ofa generating process, respectively. The equivalence relation(1) states that any two histories ←−y and ←−y ′ are equivalent ifthe probability distribution over their futures is the same, i.e.P (−→Y ∣←−Y =

←−y ) = P (−→Y ∣←−Y =←−y ′). Given the above equival-

ence relation, one can then define a mapping from the spaceof histories

←−Y to ‘causal’ states of the underlying process.

Such mapping ε ∶←−Y → S is called an ε-map that takes a

given past←−y to its corresponding causal state σi:

ε (←−y ) = σi = ←−y ′ ∶←−y ∼ε ←−y ′Intuitively, one can think of ε as a function that partitions theset of pasts

←−Y according to the equivalence relation (1), into

clusters that lead to the same distribution over futures.

We can then define the ε-machine as a tuple (Y ,S,T )where Y denotes the process’ alphabet, S its causal state set,and T a set of symbol-valued transition matrices:

T ≡ T (y)y∈Y

where T (y) comprises the following elements:

T(y)ij = P(S1 = σj , Y0 = y∣S0 = σi),

where S0 and S1 denote two temporally consecutive randomvariables of a random variable chain defining the causal stateprocess that underlies the stochastic process we are model-ing. Each St takes on some value st = σi ∈ S. The setT thus defines the dynamics over causal states of the under-lying system. An ε-machine is then a unique, minimal-size,maximally predictive unifilar representation of the observedprocess [Crutchfield and Young, 1989].

Despite their desirable minimality and optimality proper-ties, ε-machines are inherently able to model generative pro-cesses only. In order to exploit the useful properties thatε-machine’s formulation offers for the analysis of our cur-riculum, we need to go beyond output processes and exploita slightly more complex representation, that of input-outputprocesses, namely ε-transducers.

8.2 ε-TransducerIn our scenario, we are interested in observing the structure ofnot only one process, but a coupling between two stochasticprocesses, which can also be viewed as the analysis of a com-munications channel between an agent and its environment.The concept of an equivalence relation can be extended, forthis type of scenario, over joint (input-output) pasts:

←−−−−(x, y) ∼ε←−−−−(x, y)′⟺

P (−→Y ∣−→X,←−−−−−(X,Y ) =←−−−−(x, y))

= P (−→Y ∣−→X,←−−−−−(X,Y ) =

←−−−−(x, y)′) (2)

, defining a basis for a channel’s unique, maximally pre-dictive, minimal statistical complexity unifilar presentation,called ε-transducer. Similarly to ε-machines, a map from

pasts to states can be created ε ∶←−−−−−(X ,Y) → S that maps

given joint input-output past←−−−−(x, y) to its corresponding chan-

nel causal state

ε (←−−−−(x, y)) = σi = ←−−−−(x, y)′ ∶←−−−−(x, y) ∼ε←−−−−(x, y)′ .

The ε-transducer is then defined as the tuple (X ,Y ,S,T ),where, in contrast to ε-machines, X additionally defines theset of inputs and T the set of conditional transition probabil-ities:

T ≡ T (y∣x)x∈X ,y∈Y

where T (y∣x) has elements:

T(y∣x)ij = P(S1 = σj , Y0 = y∣S0 = σi, X0 = x)

The ability of ε-transducers to model input-output mappings,enables the incorporation of actions and makes modeling oftasks in a curriculum C feasible.

Page 7: General AI Challenge Round One: Gradual Learning

8.3 Statistical and Structural ComplexityHaving defined both ε-machines and ε-transducers, we cannow define a distribution π over causal states for ε-machinesas

π(i) = P(S0 = σi)= P (ε (←−Y ) = σi)

and an input-dependent state distribution πX for ε-transducers:

πX(i) = PX(S0 = σi)

= PX (ε (←−−−−−(X,Y )) = σi)

In both cases, this defines the asymptotic probability of a pro-cess being in any one of its causal states. This informationcan then be used for quantifying the complexity of the under-lying generating or input-output process, respectively. Thismeasure, also called statistical complexity in the case of ε-machines, is defined as Cµ = H[π], where µ points to thefact that it is a measure over an asymptotic distribution, andH denotes the Shannon entropy [Shannon and Weaver, 1948].On the other hand, ε-transducer’s input-dependent statisticalcomplexity is defined as

CX = H[πX].The upper bound on ε-transducer complexity is then the chan-nel complexity, calculated as the supremum of statisticalcomplexity over input processes:

Cµ = supXCX

with topological state complexity being C0 = log2 ∣S∣ anddue to the fact that uniform distributions maximize Shannonentropy, in general Cµ ≤ C0 [Barnett and Crutchfield, 2015].

Statistical complexityCµ is a widely used measure of com-plexity in physics and complexity science. It has a numberof benefits over other measures and intuitive interpretations.For example, it measures the amount of information a pro-cess needs to store in order to be able to reconstruct a givensignal, or in the case of structural complexity, it defines thecapacity of a communication channel. In the case of tasks inour challenge, it could provide a principled way of measuringan upper bound on task complexity that can be beneficial inthe construction of structured curricula and subsequent auto-matic generation of tasks of increasing complexity.

9 Agent-agnostic Task and CurriculumRepresentation

Modeling tasks T of the challenge curriculum C with the helpof ε-transducers allow for an agent-agnostic representation ofthe complexity of the problem each task tackles. It not onlyprovides a way to measure complexity of each task, but alsoto compare and contrast entire task curricula and reason aboutthe inherent compositionality and graduality of an entire cur-riculum (see Figure (3b)). Preliminary results for a numberof tasks from the challenge curriculum can be seen in the ap-pendix.

9.1 Task and Curriculum ComplexityUsing statistical and structural complexity we are able to startreasoning about the correctness of the curriculum order. Thisallows for more well defined curricula and subsequent pos-sibility of automating the curriculum generating process withguaranteed characteristics.

9.2 Task and Curriculum GradualityThe minimal representation of tasks also allows us to startthinking about comparison of how much structure is sharedamong tasks within a curriculum. This can be beneficial indesigning and measuring the intrinsic graduality of a cur-riculum. As ε-machines and transducers are inherently graph-ical models, graph theoretic tools and algorithms, such assubgraph-matching, can be used to start comparing tasks andmeasure their level of shared structure that can be eventuallyexploited by an agent with gradual learning capabilities.

10 ConclusionsIn this work, we have outlined the structure and goals of thefirst round of the General AI Challenge, focused on buildingagents that can learn gradually. We have argued that build-ing such agents requires solving problems at three differentlevels of hierarchy. From fast learning, akin to a search for apolicy at a level of an instance of a task, through slow learningthat requires the discovery of a meta-policy that generalizesacross instance, all the way to fast adaptation to new tasks,akin to policy transfer between disparate problems. Further-more, we have proposed the use of a method from computa-tional mechanics, namely ε-transducers for modeling of tasksin a curriculum that allow for quantifying task complexity andpotentially graduality of entire curricula. Explicit calculationof the complexity of tasks and its subsequent use for buildingbetter curricula is left for future work.

AcknowledgementsWe would like to thank the entire GoodAI team, in particularOlga Afanasjeva and Marek Rosa.

References[Adolph and Franchak, 2017] Karen E. Adolph and John M.

Franchak. The development of motor behavior. Wiley In-terdisciplinary Reviews: Cognitive Science, 8(1-2), 2017.

[Alexander and Brown, 2015] William H. Alexander andJoshua W. Brown. Hierarchical Error Representation: AComputational Model of Anterior Cingulate and Dorsolat-eral Prefrontal Cortex. Neural Computation, 27(11):2354–2410, 2015.

[Andreas et al., 2016a] Jacob Andreas, Marcus Rohrbach,Trevor Darrell, and Dan Klein. Learning to Com-pose Neural Networks for Question Answering.arXiv:1601.01705, 2016.

[Andreas et al., 2016b] Jacob Andreas, Marcus Rohrbach,Trevor Darrell, and Dan Klein. Neural Module Networks.In CVPR, pages 39–48, 2016.

Page 8: General AI Challenge Round One: Gradual Learning

[Andreas et al., 2017] Jacob Andreas, Dan Klein, andSergey Levine. Modular Multitask Reinforcement Learn-ing with Policy Sketches. In ICLR, 2017.

[Balcan et al., 2015] Maria-Florina Balcan, Avrim Blum,Santosh Vempala, and Georgia Tech. Efficient Represent-ations for Lifelong Learning and Autoencoding. In COLT,pages 1–20, 2015.

[Barnett and Crutchfield, 2015] Nix Barnett and James P.Crutchfield. Computational Mechanics of Input-OutputProcesses: Structured Transformations and the eps-Transducer. Journal of Statistical Physics, 161(2):404–451, 2015.

[Baroni et al., 2017] Marco Baroni, Armand Joulin, AllanJabri, German Kruszewski, Angeliki Lazaridou, KlemenSimonic, and Tomas Mikolov. CommAI: Evaluating thefirst steps towards a useful general AI. arXiv:1701.08954,2017.

[Bengio et al., 2009] Yoshua Bengio, Jerome Louradour,Ronan Collobert, and Jason Weston. Curriculum learning.In ICML, pages 41–48. ACM Press, 2009.

[Bostrom, 2014] Nick Bostrom. Superintelligence : Paths,Dangers, Strategies. Oxford University Press, 2014.

[Chen et al., 2015] Tianqi Chen, Ian Goodfellow, and Jona-thon Shlens. Net2Net: Accelerating Learning via Know-ledge Transfer. arXiv:1511.05641, pages 1–10, 2015.

[Chen et al., 2016a] Yutian Chen, Matthew W. Hoffman,Sergio G. Colmenarejo, Misha Denil, Timothy P. Lil-licrap, Matt Botvinick, and Nando de Freitas. Learn-ing to learn without gradient descent by gradient descent.arXiv:1611.03824, 2016.

[Chen et al., 2016b] Yutian Chen, Matthew W. Hoffman,Sergio G. Colmenarejo, Misha Denil, Timothy P. Lillicrap,and Nando de Freitas. Learning to Learn for Global Op-timization of Black Box Functions. In NIPS, 2016.

[Crutchfield and Young, 1989] James P. Crutchfield and KarlYoung. Inferring statistical complexity. Physical ReviewLetters, 63(2):105–108, 1989.

[Crutchfield, 1994] James P. Crutchfield. The calculi ofemergence: computation, dynamics and induction. Phys-ica D: Nonlinear Phenomena, 75(1-3):11–54, aug 1994.

[Crutchfield, 2011] James P. Crutchfield. Between order andchaos. Nat Phys, 8(February):17–24, 2011.

[Ditzler et al., 2015] Gregory Ditzler, Manuel Roveri,Cesare Alippi, and Robi Polikar. Learning in Non-stationary Environments: A Survey. IEEE ComputationalIntelligence Magazine, 10(4):12–25, nov 2015.

[Domingos, 2015] Pedro Domingos. The Master Algorithm:How the Quest for the Ultimate Learning Machine WillRemake Our World. 2015.

[Duan et al., 2017a] Yan Duan, Marcin Andrychowicz,Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sut-skever, Pieter Abbeel, and Wojciech Zaremba. One-ShotImitation Learning. arXiv:1703.07326, 2017.

[Duan et al., 2017b] Yan Duan, John Schulman, Xi Chen,Peter L. Bartlett, Ilya Sutskever, Pieter Abbeel, and Com-puter Science. RL2: Fast Reinforcement Learning viaSlow Reinforcement Learning. In ICLR, pages 1–14,2017.

[Fahlman and Lebiere, 1990] Scott E. Fahlman and Chris-tian Lebiere. The Cascade-Correlation Learning Architec-ture. In NIPS, pages 524–532, 1990.

[Fernando et al., 2017] Chrisantha Fernando, DylanBanarse, Charles Blundell, Yori Zwols, David Ha,Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra.PathNet: Evolution Channels Gradient Descent in SuperNeural Networks. arXiv:1701.08734, 2017.

[French, 1999] Robert M. French. Catastrophic forgettingin connectionist networks. Trends in Cognitive Sciences,3(4):128–135, 1999.

[Ganegedara et al., 2016] Thushan Ganegedara, Lionel Ott,and Fabio Ramos. Online Adaptation of Deep Architec-tures with Reinforcement Learning. arXiv:1608.02292,2016.

[Gershman et al., 2015] Samuel J Gershman, Eric J Horvitz,and Joshua B Tenenbaum. Computational rationality: Aconverging paradigm for intelligence in brains, minds, andmachines. Science, 349(6245):273–278, 2015.

[Gulcehre and Bengio, 2016] Caglar Gulcehre and YoshuaBengio. Knowledge Matters: Importance of Prior Inform-ation for Optimization. JMLR, 17(8):1–32, 2016.

[Gulcehre et al., 2016] Caglar Gulcehre, Marcin Moczulski,Francesco Visin, and Yoshua Bengio. Mollifying Net-works. arXiv:1608.04980, pages 1–11, 2016.

[Hamker, 2001] Fred H Hamker. Life-long learning CellStructures - Continuously learning without catastrophic in-terference. Neural Networks, 14(4-5):551–573, 2001.

[Kirkpatrick et al., 2016] James Kirkpatrick, Razvan Pas-canu, Neil Rabinowitz, Joel Veness, Guillaume Des-jardins, Andrei A. Rusu, Kieran Milan, John Quan,Tiago Ramalho, Agnieszka Grabska-Barwinska, DemisHassabis, Claudia Clopath, Dharshan Kumaran, and RaiaHadsell. Overcoming catastrophic forgetting in neural net-works. arXiv:1612.00796, 2016.

[Krueger and Dayan, 2009] Kai A. Krueger and PeterDayan. Flexible shaping: How learning in small stepshelps. Cognition, 110(3):380–394, 2009.

[Kumaran et al., 2016] Dharshan Kumaran, DemisHassabis, and James L. McClelland. What LearningSystems do Intelligent Agents Need? ComplementaryLearning Systems Theory Updated. Trends in CognitiveSciences, 20(7):512–534, 2016.

[Lee et al., 2017] Sungtae Lee, Sang-Woo Lee, JinyoungChoi, Dong-Hyun Kwak, and Byoung-Tak Zhang. Micro-Objective Learning : Accelerating Deep ReinforcementLearning through the Discovery of Continuous Subgoals.arXiv:1703.03933, 2017.

Page 9: General AI Challenge Round One: Gradual Learning

[Mhaskar and Poggio, 2016] Hrushikesh Mhaskar and To-maso Poggio. Deep vs. shallow networks : An approxim-ation theory perspective. arXiv:1608.03287, pages 1–16,2016.

[Mhaskar et al., 2016] Hrushikesh Mhaskar, Qianli Liao,and Tomaso Poggio. Learning Functions: When Is DeepBetter Than Shallow. arXiv:1603.00988, (45):1–12, 2016.

[Mikolov et al., 2015] Tomas Mikolov, Armand Joulin, andMarco Baroni. A Roadmap towards Machine Intelligence.arXiv:1511.08130, pages 1–39, 2015.

[Oquab et al., 2014] M Oquab, L Bottou, I Laptev, andJ Sivic. Learning and transferring mid-level image repres-entations using convolutional neural networks. In CVPR,2014.

[Pan and Yang, 2010] Sinno J. Pan and Qiang Yang. A Sur-vey on Transfer Learning. IEEE Trans. Knowl. Data Eng.,22(10):1345–1359, 2010.

[Pentina and Lampert, 2015] Anastasia Pentina and Chris-toph H Lampert. Lifelong Learning with Non-i.i.d. Tasks.In NIPS, pages 1–9, 2015.

[Poggio et al., 2015] Tomaso Poggio, Fabio Anselmi, andLorenzo Rosasco. I-theory on depth vs width: hierarch-ical function composition. CBMM Memos, 2015.

[Poliak et al., 2017] Martin Poliak, Martin Stransky, MichalVlasak, Jan Sinkora, and Premek Paska. General AI Chal-lenge - Round 1. https://github.com/general-ai-challenge,2017.

[Polya, 2004] G Polya. How to solve it: a new aspect ofmathematical method. Princeton University Press, 2004.

[Rosa et al., 2016] Marek Rosa, Jan Feyereisl, andThe GoodAI Collective. A Framework for Search-ing for General Artificial Intelligence. arXiv:1611.00685,2016.

[Rosa, 2017] Marek Rosa. General AI Challenge Specifica-tions of the First (Warm-Up) Round: Gradual Learning -Learning Like a Human. GoodAI, 2017.

[Rusu et al., 2016] Andrei A. Rusu, Neil C. Rabinowitz,Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell.Progressive Neural Networks. arXiv:1606.04671, 2016.

[Ruvolo and Eaton, 2013] Paul Ruvolo and Eric Eaton.ELLA: An efficient lifelong learning algorithm. In ICML,pages 507–515, 2013.

[Salakhutdinov et al., 2013] Ruslan Salakhutdinov,Joshua B. Tenenbaum, and Antonio Torralba. Learningwith Hierarchical-Deep Models. IEEE Trans. PatternAnal. Mach. Intell., (8):1958–1971, 2013.

[Santoro et al., 2016] Adam Santoro, Sergey Bartunov, Mat-thew Botvinick, Daan Wierstra, and Timothy Lillicrap.Meta-Learning with Memory-Augmented Neural Net-works. In ICML, pages 1842–1850, 2016.

[Shannon and Weaver, 1948] C. E. Shannon and W. Weaver.A mathematical theory of communication. Bell Syst. Tech.J, 27:379–423, 1948.

[Stanley and Miikkulainen, 2002] Kenneth O. Stanley andRisto Miikkulainen. Evolving neural networks throughaugmenting topologies. Evol. Comput., 10(2):99–127,2002.

[Stone et al., 2016] Peter Stone, Rodney Brooks, Erik Bryn-jolfsson, Ryan Calo, Oren Etzioni, Greg Hager, JuliaHirschberg, Shivaram Kalyanakrishnan, Ece Kamar, SaritKraus, Kevin Leyton-Brown, David Parkes, William Press,AnnaLee Saxenian, Julie Shah, Milind Tambe, and AstroTeller. ”Artificial Intelligence and Life in 2030.” One Hun-dred Year Study on Artificial Intelligence: Report of the2015-2016 Study Panel. Technical report, Stanford Uni-versity, Stanford, CA, 2016.

[Tommasino et al., 2016] Paolo Tommasino, Daniele Cali-giore, Marco Mirolli, and Gianluca Baldassarre. A Re-inforcement Learning Architecture that Transfers Know-ledge between Skills when Solving Multiple Tasks. IEEETransactions on Cognitive and Developmental Systems,pages 1–1, 2016.

[Vapnik, 2015] Vladimir Vapnik. Learning Using PrivilegedInformation : Similarity Control and Knowledge Transfer.Journal of Machine Learning Research, 16:2023–2049,2015.

[Vezhnevets et al., 2017] Alexander Sasha Vezhnevets, Si-mon Osindero, Tom Schaul, Nicolas Heess, Max Jader-berg, David Silver, and Koray Kavukcuoglu. FeUdal Net-works for Hierarchical Reinforcement Learning. In ICML,2017.

[Wang et al., 2016] Jane X. Wang, Zeb Kurth-Nelson,Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, RemiMunos, Charles Blundell, Dharshan Kumaran, andMatt Botvinick. Learning to reinforcement learn.arXiv:1611.05763, 2016.

[Zaremba and Sutskever, 2015] Wojciech Zaremba and IlyaSutskever. Reinforcement Learning Neural Turing Ma-chines. arXiv:1505.00521, pages 1–14, 2015.

[Zhou et al., 2012] Guanyu Zhou, Kihyuk Sohn, andHonglak Lee. Online Incremental Feature Learning withDenoising Autoencoders. In AISTATS, pages 1453–1461,2012.

Page 10: General AI Challenge Round One: Gradual Learning

A AppendixAll figures in this section are simplified visualizations of ε-transducers, each for a particular task. These simplifications comein four flavors:

1. Only two instances, out of a numerous possible, are shown, as others would only be repetitions of an already presentedpattern (black rectangles delineate instances).

2. For the sake of clarity not all transitions (arrows) are always visualized (e.g. error or instance switch transitions).3. All transition probabilities from a state are uniformly distributed among visualized transitions. Only instance decision

transition probabilities are shown for illustration.4. In Figure 4, states are merged together, to avoid repetition when the environment is providing the current instance de-

scription. This way the visualization is still readable and focuses more on the important part − the part where the agentreplies.

General notation for transitions (arrows) between causal states is as follows: (ot+1, rt∣at), where ot is an observation/state thatis the output of the environment, rt the reward for the last performed action at−1 by the agent. Note that the action at at currenttime-step determines the reward rt at current time-step, but the observation/state for the subsequent step.

start

A0(A,∅)|∅ : 1/2

B0

(A,∅)|∅ : 1/2

A1(A,+)|'a' A2(A,+)|'a' A3(A,+)|'a' A4(A,+)|'a'

B1(A,+)|'b' B2(A,+)|'b' B3(A,+)|'b' B4(A,+)|'b'

(a)

Ab2 Ab3(Ab,+)|'b'

Aa3

(Aa,+)|'b'

Ab4(Ab,+)|'b'

Aa4

(Aa,+)|'b'

Aa0

Ab1

(Ab,+)|'a'Aa1(Aa,+)|'a' Aa2

(Ab,+)|'a'

(Aa,+)|'a'

(Ab,+)|'a'

(Aa,+)|'a'

Ab0 (Ab,+)|'b'

(Aa,+)|'b'

(Ab,+)|'b'

(Aa,+)|'b'

(Ab,+)|'a'

(Aa,+)|'a'

Bb1 Bb2(Bb,+)|'d'

Ba2

(Ba,+)|'d'

Bb3(Bb,+)|'d'

Ba3

(Ba,+)|'d'

Bb4(Bb,+)|'d'

Ba4

(Ba,+)|'d'

Ba0(Bb,+)|'c'

Ba1(Ba,+)|'c'

(Bb,+)|'c'

(Ba,+)|'c'

(Bb,+)|'c'

(Ba,+)|'c'

(Bb,+)|'c'

(Ba,+)|'c'

Bb0

(Bb,+)|'d'

(Ba,+)|'d'

start

(Aa,∅)|∅ : 1/4

(Ab,∅)|∅ : 1/4

(Ba,∅)|∅ : 1/4

(Bb,∅)|∅ : 1/4

(b)

Figure 3: A simplified visualization of ε-transducers for two consecutive tasks A.2.1 (a) and A.2.2 (b). No error or instanceswitch transitions are shown. One can see the apparent possible sub-structures that the two tasks share, that could be exploitedby a gradual learner. The state naming convention is as follows:

• First (capital) letter denotes an instance• The number at the end of the state descriptor represents the number of consecutive positive rewards the agent has received• The middle letter (only in the Subfigure b) defines from which group a letter was provided by the environment for the next

stepFor these visualizations, five consecutive positive rewards are required to complete an instance (fully shown in Figure 5).

Page 11: General AI Challenge Round One: Gradual Learning

¬'a

'

(' ',

)|A

\'.'

¬'a

.'

(' ',-

)|'.'

'wro

ng!

a .;

'

('w

rong!

a .;

', )

|' '

('

wro

ng!

a .;

',-)

|A\' '

's'

(';s

', )

|' '

('

;s',-)

|A\' '

'sp

ell:

a''s

pel

l: a.

'

('.',

)|' '

('

.',-

)|A

\' '

(' ',

)|A

\'a'

'a'

(' ',

)|'a

'('

',

)|A

\' '

'a

(' ',

)|' '

(' ',

)|A

\'.'

'a .

'

(' ',+

)|'.'

'go

od

jo

b!

a .;

'

('go

od

jo

b!

a .;

', )

|' '

('

go

od

jo

b!

a .;

',-)

|A\' '

(';s

', )

|' '

('

;s',-)

|A\' '

¬'b

'

(' ',

)|A

\'.'

¬'b

.'

(' ',-

)|'.'

'wro

ng!

b .

;'('

wro

ng!

b .

;',

)|' '

('

wro

ng!

b .

;',-

)|A

\' '

(';s

', )

|' '

('

;s',-)

|A\' '

'sp

ell:

b'

'sp

ell:

b.'

('.',

)|A

' '

('

.',-

)|Α

\' '

(' ',

)|A

\'b

'

'b'

(' ',

)|'b

'

(' ',

)|A

\' '

'b '

(' ',

)|' '

(' ',

)|A

\'.'

'b .

'('

',+

)|'.'

'go

od

jo

b!

b .

;'

('go

od

jo

b!

b .

;',

)|' '

('

go

od

jo

b!

b .

;',-

)|A

\' '

(';s

', )

|' '

('

;s',-)

|A\' '

star

t('s',∅)|∅

'sp

ell:

'

('p

ell:

', )

|' '

('

pel

l: ',-)

|A\' '

('a'

, )|' ' : 1/2

('a'

,-)|A

\' ' : 1

/2

('b

', )

|' ' : 1/2

('b

',-)

|A\' ' : 1

/2

Figu

re4:

Asi

mpl

ified

visu

aliz

atio

nofε-

tran

sduc

erfo

rtas

kA

.2.9

.1fr

omth

ech

alle

nge

curr

icul

um.

Page 12: General AI Challenge Round One: Gradual Learning

Aa0

(Aa,-)|A\'a'

Ab0

(Ab,-)|A\'a'

Aa1

(Aa,+)|'a'

Ab1

(Ab,+)|'a' Aa2

(Aa,-)|A\'a'

(Ab,-)|A\'a'

Ab3

(Ab,+)|'a'

Aa3(Aa,+)|'a'

(Aa,-)|A\'b'

(Ab,-)|A\'b'

(Aa,+)|'b'

(Ab,+)|'b'

(Aa,-)|A\'a'

(Aa,+)|'a'

(Ab,-)|A\'a'

Ab2

(Ab,+)|'a'

Ab4

(Aa,-)|A\'b'

(Aa,+)|'b': 1/4

(Ab,-)|A\'b'

(Ab,+)|'b': 1/4

Bb0

(Bb,+)|'b': 1/4

Ba0

(Ba,+)|'b': 1/4

(Aa,-)|A\'b'

(Aa,+)|'b'

(Ab,-)|A\'b'

(Ab,+)|'b'

(Aa,-)|A\'b'

(Ab,-)|A\'b'

(Ab,+)|'b'

(Aa,+)|'b'

(Aa,-)|A\'b'

(Ab,-)|A\'b'

(Ab,+)|'b'

Aa4

(Aa,+)|'b'

(Aa,-)|A\'a'

(Aa,+)|'a': 1/4

(Ab,-)|A\'a'

(Ab,+)|'a': 1/4

(Bb,+)|'a': 1/4

(Ba,+)|'a': 1/4

(Aa,-)|A\'a'

(Ab,-)|A\'a'

(Ab,+)|'a'

(Aa,+)|'a'

(Bb,-)|A\'d'

Bb1

(Bb,+)|'d'

Ba1

(Ba,+)|'d'

(Ba,-)|A\'d'

Ba3

(Bb,-)|A\'c'

Ba4(Ba,+)|'c'

Bb4(Bb,+)|'c'

(Ba,-)|A\'c'

(Aa,+)|'c': 1/4

(Ab,+)|'c': 1/4

(Bb,-)|A\'c'

(Bb,+)|'c': 1/4

(Ba,-)|A\'c'

(Ba,+)|'c': 1/4

Ba2(Bb,-)|A\'c'

(Ba,+)|'c'

Bb3(Bb,+)|'c'

(Ba,-)|A\'c'

Bb2

(Bb,-)|A\'d'

(Ba,+)|'d'

(Bb,+)|'d'

(Ba,-)|A\'d'

(Bb,-)|A\'d'

(Ba,+)|'d'

(Bb,+)|'d'

(Ba,-)|A\'d'

(Aa,+)|'d': 1/4

(Ab,+)|'d': 1/4

(Bb,-)|A\'d'

(Bb,+)|'d': 1/4

(Ba,-)|A\'d'

(Ba,+)|'d': 1/4

(Bb,-)|A\'d'

(Ba,+)|'d'

(Bb,+)|'d'

(Ba,-)|A\'d'

(Bb,-)|A\'c'

(Ba,+)|'c'

(Bb,+)|'c'

(Ba,-)|A\'c'

(Bb,-)|A\'c'

(Bb,+)|'c'

(Ba,+)|'c'

(Ba,-)|A\'c'

start

(Aa,∅)|∅ : 1/4

(Ab,∅)|∅ : 1/4

(Bb,∅)|∅ : 1/4

(Ba,∅)|∅ : 1/4

Figure 5: Full visualization of ε-transducer for task A.2.2 from the challenge curriculum. Same state naming convention appliesas in Figure 3.