In the simplest implementations of reinforcement learning, state and action spaces are represented as tables. The entries in these tables correspond to the value estimates of particular states or actions. For these simple implementations, whether the action or state space are of a fixed size isn’t a concern. If a state has a different action set to another then it will just have more or fewer entries in the look-up table. The trade-off is that table-lookup based implementations can only be applied to problems in which the state and action spaces are small.
To effectively apply reinforcement learning to larger problems, we use function approximation. In function approximation, the values of states and actions are given as the output of parameterised functions which take states and/or actions as inputs. Function Approximation also enables us to represent the policy as a function that outputs action probabilities for an input state.
The most popular class of functions that are used for function approximation are neural networks. Most neural network architectures expect a fixed size input and output. Even those architectures that are applied to variable length sequences are typically designed around a maximum sequence length that needs to be determined beforehand.
This complicates things if we wish to use function approximation in combination with a non-fixed action space, especially in problems for which the maximum size of the inputs/outputs is very large or hard to determine. Problems with an unbounded state space are also difficult to reconcile with function approximation and in many cases, a non-fixed action space is linked to an unbounded state space.
Teaching an agent to play chess is a great example of a problem without a fixed size action space. At the start of the game, the agent can only legally move its pawns and knights. This clearly presents a different action space to the agent than a position seen later in the game where a greater variety of pieces can move.
Fortunately, researchers have found a work around that enables us to apply RL to this problem with fixed action space. In Deepmind’s alphazero agent [1], the action set remains fixed despite the number of legal actions varying in different states. This is possible because there are a reasonably small number of possible (not necessarily legal) moves (there are 4,672, see page 18 of [1] to see where this number comes from) in chess regardless of the current state of the board.
This allows a fixed action set of size 4,672 where illegal actions (i.e. trying to move a piece that doesn’t exist or move a piece that is blocked) are accounted for by setting the probability of the agent selecting illegal actions to zero and renormalizing the remaining probabilities. This is a technique known as masking and it is also utilized in neural network architectures that are applied to variable length sequences. A maximum sequence length is determined when building the architecture. Shorter length sequences have a padding token assigned to entries after the end of the sequence which tells the neural network to mask them when considering the rest of the sequence.
The technique of masking falls short for action spaces where it is not possible to determine the maximum possible size of the action space.
This is pertinent to the problem of using RL for pair selection in Buchberger’s algorithm [2]. In this problem, the action space directly corresponds to the state space. The state space is essentially a list of the pairs we are selecting between and their features. The agent is designed to perform pair selection, so the action space consists of the indices of these same pairs. Choosing a pair removes that pair and hence its index from the action space.
The size of the state and hence the action space in the next step is dependent on how the chosen action affects the environment, which in this case is the remaining steps of Buchberger’s algorithm until the next pair selection. The remaining steps of Buchberger’s algorithm can lead to a range of possible outcomes ranging from no new pairs being added to the state space to many new actions being added.
It is difficult to determine how the action space will change without actually computing the steps of Buchberger’s algorithm for that action in that given state. Given the massive number of unique states and hence actions that can occur in Buchberger’s algorithm, it means that the state space is essentially unbounded and trying to work out an upper limit on the number of actions is impossible.
So, what can we do that will enable us to continue using function approximation with non-fixed action spaces without an upper bound on the number of actions. There are a few ideas that I believe are worthy of further exploration.
The first is to impose an arbitrary bound on the size of the pair set. One way I can think of doing this would be to have a heuristic that selects the N (where N would be a tuneable parameter) “most promising” actions which would then form the action space for that state.
Another method is to evaluate each of the actions one by one and then select from the action space using these individual evaluations. This may only be an option for problems in which the action space directly corresponds to the state space. Where each “element” in the state space and its associated features maps to a single action in the action space. Provided the feature vector for each element is the same size, we could create a network architecture that accepts an individual element as an input and returns a probability of choosing the associated element as an action. Then by applying a softmax function to these probabilities, we obtain a probability distribution over the action space. This is the approach used in [2].
This is similar to how pointer networks [3] deal with sequence inputs of different sizes. Pointer networks are a type of neural network architecture created to tackle problems in which the size of the output depends on the size of the input. In the original paper, the network is used on example problems in which the input and output are sequences. It should be possible to modify the architecture so that instead of outputting a probability distribution over inputs for each index in a sequence, it could output a single distribution over inputs at each timestep in an MDP. The possible advantage of a pointer network over the previous approach is that a pointer incorporates attention which makes the output probability for a given action more dependent on the other available actions.
A final idea is to use domain knowledge of the problem to decompose the input and output to the function approximator into tensors that do remain fixed in size for every possible input state. This would require identifying features of states and corresponding actions based on these features that do not vary. For example, in Buchberger’s algorithm an example set of features could be the degree of each pair, the number of variables etc. The set of possible actions could then be something like: choose the pair with the largest degree, or the fewest variables for example. This space of actions would not vary from state to state.
A possible issue with this final solution is that it will stifle the ability of the RL agent to find features of its own and could lead to worse performance of the agent. It also requires a deeper understanding of the problem domain so that useful features can be identified.
To conclude, I believe that there is room for productive research into how to effectively apply reinforcement learning to problems in which the action space isn’t fixed. This post highlights examples in which knowledge of the problem domain enables clever workarounds and also examples in which finding a solution will require more work. In the future, we should expect that more problems of this nature will arise as reinforcement learning branches out into more complicated domains.
[1] https://www.science.org/doi/epdf/10.1126/science.aar6404
[2] http://proceedings.mlr.press/v119/peifer20a.html
[3] https://arxiv.org/abs/1506.03134
your article is great! Currently I am exploring the ideas about how to resolve non-fixed sized action space and non-fixed sized and complicated state space. I will need to check on the pointer network.
Thanks.