🚀 Next: Adaptive Agent#

Now you’ve learned the basics – we can look at an application in building a reinforcement learning (RL) agent using Trace primitives.

A Reinforcement Learning Agent#

The essence of an RL agent is to react and adapt to different situations. An RL agent should change its behavior to become more successful at a task. Using node, @bundle, we can expose different parts of a Python program to an optimizer, making this program reactive to various feedback signals. A self-modifying, self-evolving system is the definition of an RL agent. By rewriting its own rules and logic, they can self-improve through the philosophy of trial-and-error (the Reinforcement Learning way!).

Building an RL agent (with program blocks) and use an optimize to react to feedback is at the heart of policy gradient algorithms (such as PPO, which is used in RLHF – Reinforcement Learning from Human Feedback). Trace changes the underlying program blocks to improve the agent’s chance of success. Here, we can look at an example of how Trace can be used to design an RL-style agent to master the game of Battleship.

!pip install trace-opt
import opto.trace as trace
from opto.trace import node, bundle, model, GRAPH
from opto.optimizers import OptoPrime

Trace uses decorators like @bundle and data wrappers like node to expose different parts of these programs to an LLM. An LLM can rewrite the entire or only parts of system based on the user’s specification. An LLM can change various parts of this system, with feedback they receive from the environment. Trace allows users to exert control over the LLM code-generation process.

The Game of BattleShip#

A simple example of how Trace allows the user to design an agent, and how the agent self-modifies its own behavior to adapt to the environment, we can take a look at the classic game of Battleship.

battleship

(Image credit: DALL-E by OpenAI)

We already implemented a simplified version of the battleship game. The game’s rule is straightforward: our opponent has placed 8 ships on a square board. The ships vary in size, resembling a carrier, battleship, cruiser, submarine, and destroyer. We need to select a square to hit during each turn.

from examples.battleship import BattleshipBoard

board = BattleshipBoard(8, 8)

# Show the ini+tial board with ships
board.render_html(show_ships=True)
ABCDEFGH
1
2
3
4
5
6
7
8

Of course, this wouldn’t be much of a game if we are allowed to see all the ships laying out on the board – then we would know exactly which square to place a shot! After we choose a square to place a shot, our opponent will reveal whether the shot is a hit or a miss!

# Make some shots
board.check_shot(0, 0)
board.check_shot(1, 1)
board.check_shot(2, 2)

# Show the board after shots
board.render_html(show_ships=False)
ABCDEFGH
1
2
3
4
5
6
7
8

Define An Agent Using Trace#

We can write a simple agent that can play this game. Note that we are creating a normal Python class and decorate it with @model, and then use @bundle to specify which part of this class can be changed by an LLM through feedback.

Tip

Trace only has two main primitives: node and @bundle. Here, we introduce a “helper” decorator @model to expose more of the user-defined Python class to the Trace library, e.g., @model can be used for retrieving trainable parameters (declared by node and @bundle) in a Python class.

@model
class Agent:

    def __call__(self, map):
        return self.select_coordinate(map).data

    def act(self, map):
        plan = self.reason(map)
        output = self.select_coordinate(map, plan)
        return output

    @bundle(trainable=True)
    def select_coordinate(self, map, plan):
        """
        Given a map, select a target coordinate in a Battleship game.
        X denotes hits, O denotes misses, and . denotes unknown positions.
        """
        return None

    @bundle(trainable=True)
    def reason(self, map):
        """
        Given a map, analyze the board in a Battleship game.
        X denotes hits, O denotes misses, and . denotes unknown positions.
        """
        return None

Just like in the previous tutorial, we need to define a feedback function to provide the guidance to the agent to self-improve. We make it simple – just telling the agent how much reward they obtained form the game environment.

# Function to get user feedback for placing shot
def user_fb_for_placing_shot(board, coords):
    try:
        reward = board.check_shot(coords[0], coords[1])
        new_map = board.get_shots()
        terminal = board.check_terminate()
        return new_map, int(reward), terminal, f"Got {int(reward)} reward."
    except Exception as e:
        return board.get_shots(), 0, False, str(e)

Visualize Trace Graph of an Action#

We can first take a look at what the Trace Graph looks like for this agent when it takes an observation board.get_shots() from the board (this shows the map without any ship but with past records of hits and misses). The agent takes an action based on this observation.

GRAPH.clear()

agent = Agent()
obs = node(board.get_shots(), trainable=False)
output = agent.act(obs)
output.backward(visualize=True, print_limit=20)
../_images/db84440c54654da1c174e7a95accc37fcd04e0253a6e1400d530993434a47805.svg

We can see that, this is the execution graph (Trace graph) that transforms the observation (marked as list0) to the output (marked as select_coordinate0). Trace opens up the blackbox of how an input is transformed to an output in a system.

Note

Note that not all parts of the agent are present in this graph. For example, __call__ is not in this. A user needs to decide what to include and what to exclude, and what’s trainable and what’s not. You can learn more about how to design an agent in the tutorials.

Define the Optimization Process#

Now let’s see if we can get an agent that can play this game with environment reward information.

We set up the optimization procedure:

  1. We initialize the game and obtain the initial state board.get_shots(). We wrap this in a Trace node.

  2. We enter a game loop. The agent produces an action through agent.act(obs).

  3. The action output is then executed in the environment through user_fb_for_placing_shot.

  4. Based on the feedback, the optimizer takes a step to update the agent

import autogen
from opto.trace.utils import render_opt_step
from examples.battleship import BattleshipBoard

GRAPH.clear()

board = BattleshipBoard(8, 8)
board.render_html(show_ships=True)

agent = Agent()
obs = node(board.get_shots(), trainable=False)
optimizer = OptoPrime(agent.parameters(), config_list=autogen.config_list_from_json("OAI_CONFIG_LIST"))

feedback, terminal, cum_reward = "", False, 0

iterations = 0
while not terminal and iterations < 10:
    try:
        output = agent.act(obs)
        obs, reward, terminal, feedback = user_fb_for_placing_shot(board, output.data)
        hint = f"The current code gets {reward}. We should try to get as many hits as possible."
        optimizer.objective = f"{optimizer.default_objective} Hint: {hint}"
    except trace.ExecutionError as e:
        output = e.exception_node
        feedback, terminal, reward = output.data, False, 0

    board.render_html(show_ships=False)

    cum_reward += reward

    optimizer.zero_feedback()
    optimizer.backward(output, feedback)
    optimizer.step(verbose=False)

    render_opt_step(iterations, optimizer, no_trace_graph=True)
    iterations += 1
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: Got 1 reward.

f0

Reasoning: The given code uses two functions to evaluate a model strategy for a Battleship game, where `__code1` initially analyses the map and `__code0` subsequently selects a target coordinate based on the analysis. The goal is to improve the number of hits (rewards) by choosing the best coordinates for hits. Currently, `eval1` is outputting [0, 0], targeting the top-left corner of the grid, which results in only 1 reward. This suggests that the result from `__code1` ('eval0') is not effectively guiding coordinate selections in `__code0`. The current definition for `__code1` simply returns an empty string, which does not provide any useful analysis. Consequently, `eval0` which is used in `__code1` returns an empty string, leading to no beneficial strategic input to `__code0`. We need to redefine the `reason` function to better analyse the map and tailor `select_coordinate` to utilize such analysis effectively. The `select_coordinate` function should target positions marked with '.', indicating unknown, potentially hiding ships, rather than targeting known misses or hits.

r1

Improvement

__code1:

def reason(self, map):
    
    def count_targets(grid, target='.'):
        count = 0
        for row in grid:
            count += row.count(target)
        return count
    return count_targets(map)

__code0:

def select_coordinate(self, map, plan):
    
    def find_unknown(grid):
        for i, row in enumerate(grid):
            for j, cell in enumerate(row):
                if cell == '.':
                    return [i, j]
    return find_unknown(map)

a1
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: Got 1 reward.

f1

Reasoning: The task is to improve the number of "hits" or "hits" identified by the Battleship game logic provided in the code. There are two functions involved: __code1 and __code0. __code1 counts the unknown positions (represented as '.') on a given map. __code0, based on the result from __messages__.__ode1 (the count of unknowns), tries to find a coordinate of an unknown position on a second map and suggests it as the next coordinate to target. According to the feedback "Got 1 reward," the current system successfully hits one target but needs to increase this count. From the provided data, we can observe: 1. The maps list1 and list2, which are identical, have multiple '.' indicating unknown spots that could potentially be targets. 2. __code1 returns the total count of '.' across the board which is 63 in eval2. This is correct based on the number of '.' in a single grid multiplied by the number of grids. 3. __code0 should ideally provide a better coordinate that can lead to hitting a target rather than just identifying the first '.' found. Based on this analysis, the feedback suggests that although a '.' was indeed found and probably hit as a result of eval3's logic (returning [0, 1]), the function might need improvement to either identify a more likely target for a hit or a better strategy to maximize the hits.

r2

Improvement

__code0:

def select_coordinate(self, map, plan):
    
    def find_unknown(grid, plan):
        # Already found target, so try another tactic like selecting the last unknown in a row
        for i, row in enumerate(grid[::-1]):
            for j, cell in enumerate(row[::-1]):
                if cell == '.':
                    return [len(grid) - 1 - i, len(row) - 1 - j]
    return find_unknown(map, plan)

a2
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: Got 0 reward.

f2

Reasoning: The objective is to maximize a strategic approach in the Battleship game by improving the choice of coordinates for our play. The given feedback suggests that the result of the evaluated code which chose the coordinate [7, 7] yielded a reward of 0. This implies the chosen coordinate wasn't optimal, likely because it picked a coordinate that was not strategic based on the current state of the grid in list4, which showed an 'O' (a miss) specifically at [7, 7]. Both __code1 and __code0 functions focus on selecting and/or counting positions noted as '.', which represents unknowns, but the current strategy in __code0 of choosing the last unknown position in reverse order is resulting in selecting an already determined miss. To improve our strategy and potentially get a better reward, we can modify __code0. The modification will change the tactic to not only look for the last unknown symbol '.' but also to avoid previously marked misses 'O', leading to a potentially more successful new coordinate selection.

r3

Improvement

__code0:

def select_coordinate(self, map, plan):\n        \"\"\"\n        Given a map, select a target coordinate in a Battleship game.\n        X denotes hits, O denotes misses, and . denotes unknown positions.\n        \"\"\"\n        def find_unknown(grid, plan):\n            # Avoid already determined hits 'X' or misses 'O', select the first unknown symbol '.'\n            for i, row in enumerate(grid):\n                for j, cell in enumerate(row):\n                    if cell == '.':\n                        return [i, j]\n        return find_unknown(map, plan)

a3
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: (SyntaxError) unexpected character after line continuation character (, line 1)

f3

Reasoning: The primary issue is the syntax error in the code provided in '__code0', which prevents the code from executing. This syntax error arises from a mistake in the indentation or a missing colon at the end of the function find_unknown. The SyntaxError message suggests that there is a line continuation character problem, usually related to backslashes or newline errors. However, reviewing the provided code, we see no blatant use of such problematic characters, but the function 'find_unknown' lacks a proper indentation and separation from the rest of the 'select_coordinate' function body. To fix this, we need to properly nest the 'find_removed' function within 'select_coordinate', ensuring it's defined correctly as a nested function.

r4

Improvement

__code0:

def select_coordinate(self, map, plan):
        """
        Given a map, select a target coordinate in a Battleship game.
        X denotes hits, O denotes misses, and . denotes unknown positions.
        """
        def find_unknown(grid, plan):
            # Avoid already determined hits 'X' or misses 'O', select the first unknown symbol '.'
            for i, row in enumerate(grid):
                for j, cell in enumerate(row):
                    if cell == '.':
                        return [i, j]
            return None
        return find_unknown(map, plan)

a4
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: Got 0 reward.

f4

Reasoning: The instruction specifies that the aim is to improve the output towards gaining more hits --- ideally targeting unknown positions ('.') which might be ships in a Battleship game. The feedback indicates that while a coordinate was suggested by the code (__code0), it resulted in '0 reward,' likely meaning no hit was made on the suggested target. Analyzing __code0 reveals that it chooses coordinates based on finding the first unknown position ('.') in the map. __code1, which is provided to __code0 as 'plan', counts the number of '.' in the map. Given that the first '.' yielded a 0 reward, there might be a strategic misalignment in selecting coordinates to maximize hits. The best approach to maximize chances of getting a hit could be altering the __code0 by changing the strategy from selecting the first unknown position to a more advanced heuristic. Possible strategies could involve picking a random unknown position instead of the first one or using a probability model if further data about ship placement are available (not in the current context). For simplicity, the suggestion is to adjust __code0 to select a random '.' in the map instead of the first one, which might distribute the 'exploration' process better across the map.

r5

Improvement

__code0:

def select_coordinate(self, map, plan):
        """
        Given a map, select a target coordinate in a Battleship game.
        X denotes hits, O denotes misses, and . denotes unknown positions.
        This function selects a random unknown coordinate.
        """
        import random
        def find_unknown(grid, plan):
            # Collect all unknown cells
            unknowns = []
            for i, row in enumerate(grid):
                for j, cell in enumerate(row):
                    if cell == '.':
                        unknowns.append([i, j])
            # Randomly select one of the unknown cells if any are available
            return random.choice(unknowns) if unknowns else None
        return find_unknown(map, plan)


a5
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: Got 0 reward.

f5

Reasoning: The task is to improve the result of eval10 by altering the code in __code0 and __code1. __code1 counts the number of unknown cells ('.') in the map, which is used as `plan` by __code0. __code0 is intended to select an unknown cell coordinate on the map based on the value of `plan`. However, `plan` being only a count introduces a limitation as it doesn't provide specific locations or optimize the strategy of selecting the coordinate. Because of this setup, __code0 fails to use `plan` effectively because the code doesn't implement logic that uses the count to select positions. The feedback indicates a reward of 0, meaning that no optimal choices were made for selecting targets that are likely to be hits. To enhance the function, we change __code0 so it incorporates a more advanced strategy for target selection, considering the values of 'X' and 'O' around the unknown positions, which will probably improve the effectiveness of the target choosing mechanism.

r6

Improvement

__code1:

def reason(self, map):
    
    def count_targets(grid, target='.'):
        count = 0
        for row in grid:
            count += row.count(target)
        return count
    return count_targets(map)

__code0:

def select_coordinate(self, map, plan):
        """
        Given a map, select a target coordinate in a Battleship game.
        X denotes hits, O denotes misses, and . denotes unknown positions.
        This function selects a strategically beneficial unknown coordinate, considering surrounding hits and misses.
        """
        import random
        def find_strategic_unknown(grid, plan):
            # Collect all unknown cells near hits
            strategic_unknowns = []
            directions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
            for i, row in enumerate(grid):
                for j, cell in enumerate(row):
                    if cell == '.':
                        # Check surrounding cells
                        for dx, dy in directions:
                            ni, nj = i + dx, j + dy
                            if 0 <= ni < len(grid) and 0 <= nj < len(row) and grid[ni][nj] == 'X':
                                strategic_unknowns.append([i, j])
                                break
            # Fallback to any unknown if no strategic unknowns found
            return random.choice(strategic_unknowns) if strategic_unknowns else random.choice(find_unknown(grid))
        return find_strategic_unknown(map, plan)

a6
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: Got 1 reward.

f6

Reasoning: The code snippet implements a Battleship game scenario where we need to optimize the selection of targets to improve the accumulated rewards. The function '__code1' analyzes the current map and counts the unknown cells ('.') on the board. However, '__code1' does not get utilized properly in the '__code0', which ought to use 'plan' (the output of '__code1') to choose a strategically beneficial coordinate to hit next. It appears that 'plan' is being ignored in '__code0'. Improving '__code0' to make better use of '__code1' output might enhance the strategic decision-making in selecting coordinates, thus potentially increasing the rewards. The problem with the current method '__code0' seems to be the fallback method 'find_unknown(grid)', which is mentioned but never actually defined or implemented in the provided code. Since this function is supposed to be called when no 'strategic_unknowns' can be found, we need to fix this fallback to handle such situations gracefully. Furthermore, integrating 'plan' into choosing a 'strategic_unknown' can refine the target selection process effectively based on the count of unknown spots as calculated in '__code1'.

r7

Improvement

__code0:

def select_coordinate(self, map, plan):\n        \"\"\"\n        Given a map, select a target coordinate in a Battleship game.\n        X denotes hits, O denotes misses, and . denotes unknown positions.\n        This function selects a strategically beneficial unknown coordinate, considering surrounding hits and misses and the detailed plan from previous evaluations.\n        \"\"\"\n        import random\n        def find_unknown(grid):\n            return [(i, j) for i, row in enumerate(grid) for j, cell in enumerate(row) if cell == '.']\n        def find_strategic_unknown(grid, plan):\n            # Collect all unknown cells near hits whom total count matches the plan\n            strategic_unknowns = []\n            directions = [(0, 1), (0, -1), (1, 0), (-1, 0)]\n            for i, row in enumerate(grid):\n                for j, cell in enumerate(row):\n                    if cell == '.' and len(strategic_unknowns) < plan:\n                        # Check surrounding cells\n                        for dx, dy in directions:\n                            ni, nj = i + dx, j + dy\n                            if 0 <= ni < len(grid) and 0 <= nj < len(row) and grid[ni][nj] == 'X':\n                                strategic_unknowns.append([i, j])\n                                break\n            # Fallback to any unknown if no strategic unknowns found\n            return random.choice(strategic_unknowns) if strategic_unknowns else random.choice(find_unknown(grid))\n        return find_strategic_unknown(map, plan)

a7
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: (SyntaxError) unexpected character after line continuation character (, line 1)

f7

Reasoning: The problem we're facing lies in the structure of the code provided under '#Variables'. The Feedback section tells us specifically that there is a SyntaxError indicating an 'unexpected character after line continuation character'. Syntax errors in Python occur when Python cannot understand a line or series of lines of code. Typically, this error might have been introduced by an invalid use of the continuation character (backslash, \) which isn't visible directly from the content given. Nonetheless, as no such character is visibly part of the provided code, the likely issue might be hidden or incorrectly inferred during the conversion or processing of the code string. To resolve this problem, we need to redefine the variable __code0 ensuring any unnecessary characters especially around continuation sequences, if any, are corrected or removed. I'll provide a revised method definition without internal syntax issues and confirm that the triple quoted strings (docstrings) or other parts do not have unwanted continuation characters that could raise a syntax error.

r8

Improvement

__code0:

def select_coordinate(self, map, plan):
    """
    Given a map, select a target coordinate in a Battleship game.
    X denotes hits, O denotes misses, and . denotes unknown positions.
    This function selects a strategically beneficial unknown coordinate, considering surrounding hits and misses and the detailed plan from previous evaluations.
    """
    import random
    def find_unknown(grid):
        return [(i, j) for i, row in enumerate(grid) for j, cell in enumerate(row) if cell == '.']
    def find_strategic_unknown(grid, plan):
        strategic_unknowns = []
        directions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
        for i, row in enumerate(grid):
            for j, cell in enumerate(row):
                if cell == '.' and len(strategic_unknowns) < plan:
                    for dx, dy in directions:
                        ni, nj = i + dx, j + dy
                        if 0 <= ni < len(grid) and 0 <= nj < len(row) and grid[ni][nj] == 'X':
                            strategic_unknowns.append([i, j])
                            break
        return random.choice(strategic_unknowns) if strategic_unknowns else random.choice(find_unknown(grid))
    return find_strategic_unknown(map, plan)

a8
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: Got 1 reward.

f8

Reasoning: The current implementation of the function __code0 tries to find a strategically valuable coordinate based on unknown positions and surrounding hits as evaluated by the eval14. The variable eval14 gives the number of target tiles, which is 58 in the current scenario, indicating a vast number of unknowns available for consideration. The function __code0 uses the eval14 (plan variable), misunderstanding its purpose by treating it as a limit for how many strategic points to consider rather than using it to adjust the probability of finding a strategic coordinate among unknowns. This discrepancy might be leading to a less optimized selection of a strike position on the Battleship board, thus possibly producing only 1 hit (instead of multiple or more strategic ones). Since eval15 outputs [1,1], this implies either a lack of multiple strikes or ineffective selection, considering the 'got 1 reward' feedback. The function __code0 should alter its approach to use eval14 for a way to boost the probabilistic selection of really beneficial hit points.

r9

Improvement

__code0:

def select_coordinate(self, map, plan):
        
        """
        Given a map, select a target coordinate in a Battleship game.
        X denotes hits, O denotes misses, and . denotes unknown positions.
        This function selects a strategically beneficial unknown coordinate, considering surrounding hits and misses and the detailed plan from previous evaluations.
        """
        import random
        def find_unknown(grid):
            return [(i, j) for i, row in enumerate(grid) for j, cell in enumerate(row) if cell == '.']
        def find_strategic_unknown(grid, plan):
            strategic_unknowns = []
            directions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
            for i, row in enumerate(grid):
                for j, cell in enumerate(row):
                    if cell == '.' and random.random() < plan/100:
                        for dx, dy in directions:
                            ni, nj = i + dx, j + dy
                            if 0 <= ni < len(grid) and 0 <= nj < len(row) and grid[ni][nj] == 'X':
                                strategic_unknowns.append([i, j])
                                break
            return random.choice(strategic_upperbounds_minus_one) if strategic_upperbounds_minus_one else random.choice(find_unknown(grid))
        return find_strategic_unknown(map, plan)

a9
ABCDEFGH
1
2
3
4
5
6
7
8

Feedback: (NameError) name 'strategic_upperbounds_minus_one' is not defined

f9

Reasoning: The feedback suggests that the variable 'strategic_upperbounds_minus_one' is not defined within the function 'select_coordinate'. This variable seems to be a crucial part of the logic to select the next move in the Battleship game, particularly when trying to find a strategic unknown location based on the plan evaluation. Upon revisiting the corresponding section of the __code0, it is evident that the variable name 'strategic_upperbounds_minus_one' is a mistake, and should likely be 'strategic_unknowns', as this list is being populated to track potential good moves based on surrounding 'X' hits. The codes need adjustment so that it properly refers to 'strategic_unknowns' when making a choice of move. This would align with the intended logic: to either choose from strategic spots if any are identified, or to fallback to any unknown spot if no strategic spots are found.

r10

Improvement

__code0:

def select_coordinate(self, map, plan):
        """
        Given a map, select a target coordinate in a Battleship game.
        X denotes hits, O denotes misses, and . denotes unknown positions.
        This function selects a strategically beneficial unknown coordinate, considering surrounding hits and misses and the detailed plan from previous evaluations.
        """
        import random
        def find_unknown(grid):
            return [(i, j) for i, row in enumerate(grid) for j, cell in enumerate(row) if cell == '.']
        def find_strategic_unknown(grid, plan):
            strategic_unknowns = []
            directions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
            for i, row in enumerate(grid):
                for j, cell in enumerate(row):
                    if cell == '.' and random.random() < plan/100:
                        for dx, dy in directions:
                            ni, nj = i + dx, j + dy
                            if 0 <= ni < len(grid) and 0 <= nj < len(row) and grid[ni][nj] == 'X':
                                strategic_unknowns.append([i, j])
                                break
            return random.choice(strategic_unknowns) if strategic_unknowns else random.choice(find_unknown(grid))
        return find_strategic_unknown(map, plan)

a10
for p in agent.parameters():
    print(p.data)
    print()
def reason(self, map):
    
    def count_targets(grid, target='.'):
        count = 0
        for row in grid:
            count += row.count(target)
        return count
    return count_targets(map)

def select_coordinate(self, map, plan):
        """
        Given a map, select a target coordinate in a Battleship game.
        X denotes hits, O denotes misses, and . denotes unknown positions.
        This function selects a strategically beneficial unknown coordinate, considering surrounding hits and misses and the detailed plan from previous evaluations.
        """
        import random
        def find_unknown(grid):
            return [(i, j) for i, row in enumerate(grid) for j, cell in enumerate(row) if cell == '.']
        def find_strategic_unknown(grid, plan):
            strategic_unknowns = []
            directions = [(0, 1), (0, -1), (1, 0), (-1, 0)]
            for i, row in enumerate(grid):
                for j, cell in enumerate(row):
                    if cell == '.' and random.random() < plan/100:
                        for dx, dy in directions:
                            ni, nj = i + dx, j + dy
                            if 0 <= ni < len(grid) and 0 <= nj < len(row) and grid[ni][nj] == 'X':
                                strategic_unknowns.append([i, j])
                                break
            return random.choice(strategic_unknowns) if strategic_unknowns else random.choice(find_unknown(grid))
        return find_strategic_unknown(map, plan)

What Did the Agent Learn?#

Then we can see how this agent performs in an evaluation run.

Note

A neural network based RL agent would take orders of magnitutde more iterations than 10 iterations to learn this kind of heuristics.

If you scroll down the output, you can see how the agent is playing the game step by step. The agent learns to apply heuristics such as – once a shot turns out to be a hit, they would hit the neighboring squares to maximize the chance of hitting a ship.

board = BattleshipBoard(8, 8)
board.render_html(show_ships=True)

terminal = False
for _ in range(15):
    try:
        output = agent.act(obs)
        obs, reward, terminal, feedback = user_fb_for_placing_shot(board, output.data)
    except trace.ExecutionError as e:
        # this is essentially a retry
        output = e.exception_node
        feedback = output.data
        terminal = False
        reward = 0

    board.render_html(show_ships=False)
    if terminal:
        break
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8
ABCDEFGH
1
2
3
4
5
6
7
8