Comparing Word Recognition by Humans and Deep Neural Networks

We compare word recognition by deep neural networks (DNN) and humans, asking whether the effects of 
increased pooling in the network can model crowding in human vision. We focus our experiments on a 
"Convolutional Recurrent Neural Network" CRNN [Shi et al. 2016], a popular model for word recognition. 
We study efficiency and crowding of the network on word recognition. To measure efficiency, we assess 
the network's performance in recognizing random 4-letter words in mono-space font at various contrast 
levels on a white noise background. We find that the network has a lower efficiency than the human 
observer: in our experiments, we found that the network has roughly one tenth of the 3\% efficiency 
that the humans attain[Pelli et al., 2003]. Letter crowding in human vision results in a minimum threshold 
spacing, independent of letter size. Crowding is usually explained as inappropriately large pooling for 
the task at hand. We studied how the network's size and spacing thresholds would be affected by changing 
its pooling from 2 to 32. The network with modified pooling was trained as specified by the original authors. 
We measured word recognition accuracy as a function of letter size and spacing. For humans tested at any 
given eccentricity, there are two regimes, one limited by crowding, and one limited by acuity[Song et al. 2014]. 
In the crowding regime, the threshold size is inversely related to spacing ratio. In the spacing regime, the 
threshold is independent of the spacing ratio. In the network, our manipulation revealed only one regime for 
all pooling values: a slope of -0.3 for a log-log plot of acuity vs spacing ratio, unlike the human data, which 
has slopes of -1(crowding limited) and 0(acuity limited). Based on these results, we believe that there are 
important limitations in how well this network models human reading.