Daniel Filan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-56

May 7, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-56.pdf

Since neural networks have become dominant within the field of artificial intelligence, a sub-field of research has emerged attempting to understand their inner workings. One standard method within this sub-field has been to primarily understand neural networks as representing human-comprehensible features. Another possibility that has been less explored is to understand them as multi-step computer programs. A seeming prerequisite for this is some form of modularity: for different parts of the network to operate independently enough to be understood in isolation, and to implement distinct interpretable sub-routines.

To find modular structure inside neural networks, we initially use the tools of graphical clustering. A network is clusterable in this sense if it can be divided into groups of neurons with strong internal connectivity but weak external connectivity. We find that a trained neural network is typically more clusterable than randomly initialized networks, and often clusterable relative to random networks with the same distribution of weights as the trained network. We investigate factors that promote clusterability, and also develop novel methods targeted at that end.

For modularity to be valuable for understanding neural networks, it needs to have some sort of functional relevance. The type of functional relevance we target is local specialization of functionality. A neural network is locally specialized to the extent that parts of its computational graph can be abstractly represented as performing some comprehensible sub-task relevant to the overall task. We propose two proxies for local specialization: importance, which reflects how valuable sets of neurons are to network performance; and coherence, which reflects how consistently their neurons associate with features of the inputs. We then operationalize these proxies using techniques conventionally used to interpret individual neurons, applying them instead to groups of neurons produced by graph clustering algorithms. Our results show that clustering succeeds at finding groups of neurons that are important and coherent, although not all groups of neurons found are so.

We conclude with a case study of using more standard interpretability tools, designed to understand the features being represented by directions in activation space, applying them to the analysis of neural networks trained on the reward function of the game CoinRun. Despite our networks achieving a low test loss, the application of interpretability tools shows that networks do not adequately represent relevant features and badly mispredict reward out of distribution. That said, these tools do not reveal a clear picture of what computation the networks are in fact performing. This not only illustrates the need for better interpretability tools to understand generalization behaviour, but motivates it: if we take these networks as models of 'motivation systems' of policies trained by reinforcement learning, the conclusion is that such networks may competently pursue the wrong objectives when deployed in richer environments, indicating a need for interpretability techniques to shed light on generalization behaviour.

Advisors: Stuart J. Russell


BibTeX citation:

@phdthesis{Filan:EECS-2024-56,
    Author= {Filan, Daniel},
    Title= {Structure and Representation in Neural Networks},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-56.html},
    Number= {UCB/EECS-2024-56},
    Abstract= {Since neural networks have become dominant within the field of artificial intelligence, a sub-field of research has emerged attempting to understand their inner workings. One standard method within this sub-field has been to primarily understand neural networks as representing human-comprehensible features. Another possibility that has been less explored is to understand them as multi-step computer programs. A seeming prerequisite for this is some form of modularity: for different parts of the network to operate independently enough to be understood in isolation, and to implement distinct interpretable sub-routines.

To find modular structure inside neural networks, we initially use the tools of graphical clustering. A network is clusterable in this sense if it can be divided into groups of neurons with strong internal connectivity but weak external connectivity. We find that a trained neural network is typically more clusterable than randomly initialized networks, and often clusterable relative to random networks with the same distribution of weights as the trained network. We investigate factors that promote clusterability, and also develop novel methods targeted at that end.

For modularity to be valuable for understanding neural networks, it needs to have some sort of functional relevance. The type of functional relevance we target is local specialization of functionality. A neural network is locally specialized to the extent that parts of its computational graph can be abstractly represented as performing some comprehensible sub-task relevant to the overall task. We propose two proxies for local specialization: importance, which reflects how valuable sets of neurons are to network performance; and coherence, which reflects how consistently their neurons associate with features of the inputs. We then operationalize these proxies using techniques conventionally used to interpret individual neurons, applying them instead to groups of neurons produced by graph clustering algorithms. Our results show that clustering succeeds at finding groups of neurons that are important and coherent, although not all groups of neurons found are so.

We conclude with a case study of using more standard interpretability tools, designed to understand the features being represented by directions in activation space, applying them to the analysis of neural networks trained on the reward function of the game CoinRun. Despite our networks achieving a low test loss, the application of interpretability tools shows that networks do not adequately represent relevant features and badly mispredict reward out of distribution. That said, these tools do not reveal a clear picture of what computation the networks are in fact performing. This not only illustrates the need for better interpretability tools to understand generalization behaviour, but motivates it: if we take these networks as models of 'motivation systems' of policies trained by reinforcement learning, the conclusion is that such networks may competently pursue the wrong objectives when deployed in richer environments, indicating a need for interpretability techniques to shed light on generalization behaviour.},
}

EndNote citation:

%0 Thesis
%A Filan, Daniel 
%T Structure and Representation in Neural Networks
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 7
%@ UCB/EECS-2024-56
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-56.html
%F Filan:EECS-2024-56