A framework for computational modelling of interaural time difference discrimination of normal and hearing-impaired listeners

Different computational models have been developed to study the interaural time difference (ITD) perception. However, only few have used a physiologically inspired architecture to study ITD discrimination. Furthermore, they do not include aspects of hearing impairment. In this work, a framework was developed to predict ITD thresholds in listeners with normal and impaired hearing. It combines the physiologically inspired model of the auditory periphery proposed by Zilany, Bruce, Nelson, and Carney [(2009). J. Acoust. Soc. Am. 126(5), 2390-2412] as a front end with a coincidence detection stage and a neurometric decision device as a back end. It was validated by comparing its predictions against behavioral data for narrowband stimuli from literature. The framework is able to model ITD discrimination of normal-hearing and hearing-impaired listeners at a group level. Additionally, it was used to explore the effect of different proportions of outer- and inner-hair cell impairment on ITD discrimination.

A framework for computational modelling of interaural time difference discrimination of normal and hearing-impaired listeners

I. INTRODUCTION
The best known theory of how we process interaural time differences (ITDs) was proposed by Jeffress (1948). In short, it suggests that ITDs are decoded by neurons in the central auditory system, which receive inputs from both ears and are sensitive to specific delays of their inputs. Thus, they function as coincidence detectors and fire at a particular ITD.
Different computational frameworks based on the Jeffress model have been developed to understand, study, and simulate ITD perception. Initially, binaural phenomena were predicted using cross-correlation of the (continuous) signals arriving to the left and right ears (Sayers and Cherry, 1957). Later on, models started incorporating the correlation of the neural response of both ears at the level of the auditory nerve (AN) as a main component. For instance, Colburn and Latimer (1978) and Colburn (1973Colburn ( , 1977) used a coarse model of the AN (consisting of a bandpass filter, automatic gain control, and an exponential rectifier) together with a binaural display to predict ITD discrimination in tone bursts across different levels, as well as binaural unmasking. Stern and Colburn (1978), Stern and Trahiotis (1992), and Stern and Shear (1996) extended this model by including a centrality weighting function (designed to model the larger proportion of coincidence detection units sensitive to shorter ITDs) and a straightness weighting function (designed to emphasize the internal delays consistent across different frequencies). Bernstein and Trahiotis (2012), Lindemann (1986), and Stern and Colburn (1978) also combined ITD with interaural level differences (ILD) information to predict perceptual lateralization as well as binaural detection.
These frameworks have contributed greatly in the studying and understanding of the processing of ITDs by the auditory system. However, their approach for modelling the AN can be considered elemental, since they include a limited amount of aspects regarding its biology. Incorporating the anatomy and the physiological processes underlying the auditory system can provide a better insight into how it processes binaural sounds. More recently, different models have been developed that explicitly include some biological aspects of binaural auditory temporal processing. For instance, Gai et al. (2014) combined a model of the AN with a model of bushy cells to study ITD coding in medial superior olive (MSO) neurons. Takanen et al. (2014) developed a method for visualizing binaural interactions (including sound lateralization using ITDs) using models of the superior olivary complex. Wang et al. (2014) presented a model of the binaural pathway capable of simulating ITD sensitivity of an inferior colliculus (IC) neuron to sinusoidally amplitude modulated tones and broadband noise. However, there are few physiologically inspired models that have been used to investigate ITD discrimination (e.g., Brughera et al., 2013;Hancock and Delgutte, 2004;Prokopiou et al., 2017).
Studying HI using a physiologically based computational approach (Durlach et al., 1981;Moore, 1996) would allow us to systematically investigate the mechanisms that are detrimental to the (binaural) hearing system and the effect of specific aspects of HI in listeners' performance in different tasks without the confounds of behavioral experiments. Moreover, it could provide us with additional means to improve the design, implementation, and fitting of hearing devices (e.g., hearing aids, cochlear implants). Furthermore, there has been no attempt to model ITD discrimination in HI listeners using a detailed physiologically based approach.
In this work, we developed a computational framework that addresses these issues. Namely, we used our framework to study ITD discrimination by predicting its thresholds (i.e., just noticeable difference) in both NH and HI listeners. It combines the physiologically inspired front end proposed by Zilany et al. (2014Zilany et al. ( , 2009) with a coincidence detection stage and a neurometric-based decision device as a back end. We validated it by comparing its ITD threshold predictions for different stimuli against behavioral data reported in literature. Additionally, we used it to explore the effect of different proportions of outer-and inner-hair cell (OHC, IHC) impairment on ITD discrimination.

II. FRAMEWORK DESCRIPTION
The proposed framework is composed of three main building blocks: (1) auditory periphery (responsible for generating spike trains elicited by a given stimulus with an imposed ITD), (2) coincidence detection (its purpose is to compute the imposed ITD), and (3) decision device (its objective is to predict an ITD threshold based on the distributions of the computed ITDs). The framework's block diagram is shown in Fig. 1 and is explained in detail below.

A. Auditory periphery
As a front end, we chose the model proposed by Zilany et al. (2014Zilany et al. ( , 2009. It is capable of reproducing response properties of AN fibers to acoustic stimuli and has been validated with physiological data over a wider range than previously existing models. Furthermore, it has been successfully used to study a variety of auditory phenomena, such as masking release (Bruce et al., 2013), frequency selectivity (Jennings and Strickland, 2012), neural adaptation to sound level (Zilany and Carney, 2010), sensory responses to musical consonance-dissonance (Bidelman and Heinz, 2011), speech intelligibility in noise , neural coding of chimaeric speech (Heinz and Swaminathan, 2009), and overshoot adaptation (Jennings et al., 2011).
It is composed of different modules, each simulating a particular element of the auditory periphery. For each FIG. 1. (Color online) Block diagram of the proposed framework. Signals coming from the left and right ear (with an imposed ITD) are processed by the model of the auditory periphery proposed by Zilany et al. (2014) and Zilany et al. (2009). The resulting spike trains are fed to a coincidence detection stage, which computes the imposed ITD. This process is repeated numerous times in order to generate a distribution of the computations for different ITD values. These distributions are used as an input to the decision device, yielding as an output a predicted threshold value in ls. channel (left and right), the model received as an input an instantaneous pressure waveform with a sampling frequency of 100 kHz. In all cases, the delay was applied to the right waveform, shifting it in time with respect to the left waveform. For each ear, the stimulus was first passed through a filter emulating the middle ear. Then, the output was passed through a signal path and a control path. The signal path simulated the behavior of the OHC-controlled filtering properties of the basilar membrane in the cochlea and the transduction properties of the IHCs by a succession of nonlinear and low-pass filters. The control path simulated the function of the OHCs in controlling basilar membrane filtering. It did so with a wideband basilar membrane filter followed by a non-linearity module and an OHC low-pass filter. The control path output fed back into itself and into the signal path, as well. The IHCs output then went through an IHC-AN synapse module with two power-law adaptation paths (which account for slow and fast adaptations) and a spike generator (which accounts for the activity, adaptation, and refractoriness of the AN response).
The model allows to be tuned with different parameters, which were defined as follows. We chose to simulate human AN fibers with the same characteristic frequency (CF) as the stimulus frequency (in the case of pure tones) or center frequency (in the case of bandpass noise). These were tuned using values reported by Shera et al. (2002). For the sake of simplicity, we decided to simulate only high spontaneous rate (100 spikes/s) fibers, since in our previous work (Prokopiou et al., 2017) we found that the use of a physiologically relevant mixture of high-, middle-, and lowspontaneous rate fibers had no significant effects on temporal coding at the simulated levels (Table I). We opted for an implementation of the power-law synapse function given it yields a more physiologically accurate response to relatively long stimuli (although at the expense of higher computational power, Zilany et al., 2009). Last, we chose to use a pre-determined fixed seed for the fractional Gaussian noise to better separate the (independent) effects of internal (i.e., physiological) and external (i.e., stimulus-driven) noise (Zilany et al., 2014).
Another important parameter was the number of spike trains that needed to be simulated in order to obtain a physiologically relevant response for each ITD computation. We defined this number based on the work by Louage et al. (2006). They recorded spike trains from AN and anteroventral cochlear nucleus neurons of cats to investigate the temporal coding of sound to noise stimuli in afferent neurons and reported collecting $3000 spikes per token. Therefore, we made sure that each simulated response had at least 3000 spikes for each channel (left and right). Considering the average number of generated spikes per second to each particular stimulus and the stimulus duration (Sec. III A), the number of required spike trains ranged from 35 up to 85 (per channel). Thus, we used information of 70-170 neurons for each ITD prediction (2 channels, left and right). The generation of spike trains is a computationally expensive task, thus we wanted to make the most out of the simulated data by creating new spike train combinations across the simulated neurons. Most importantly, we were interested in obtaining distributions of the ITD predictions (Sec. II C) without making any a priori assumptions about them (Gr€ un, 2009;Ventura, 2010). Bootstrapping is a suitable technique that helped us achieve both. It consists of generating a data subset by randomly sampling a larger data set with replacement (Witten and Frank, 2011). In our case, the data subset consisted of the required number of spike trains for each condition (35-85 spike trains per channel, Sec. II A). We defined this number arbitrarily as 20% of the larger data set. Thus, we generated 175-425 spike trains (per channel) in total for each condition.

Shuffled cross correlogram
To quantify the temporal pattern of the AN activity, we used the shuffled cross correlogram (SCC) Louage et al., 2004). The SCC is a modified version of the shuffled auto correlogram (SAC) (Joris, 2003). It can be thought as a metric of binaural coincidence detection which compares the timing of the neural firing between the left and right AN fibers. In other words, the SCC quantifies coincident spikes across two different spike trains. The construction of the SCC is illustrated in Fig. 2 and is explained as follows.
First, N spike trains from the left ear and N spike trains from the right ear were used as inputs. Then, the forward and backward time intervals between all the spikes of the first left spike train and all the spikes of all the right spike trains were measured. The same was done with all the spikes of the second left spike train and all the spikes of all the right spike trains and so on. These time intervals were tallied into a histogram with binwidth Ds. The latter can be thought as an integration window: if two spikes occur within this time interval, they are considered to be coincident. Louage et al. (2004) originally suggested using a value of 50 ls. However, the resolution of the SCC curve-and of the model's ITD predictions in consequence-is limited by it. Since the objective of the model is to predict ITD thresholds that are behaviorally relevant to human perception, 50 ls was too long. We explored using different Ds values and found a trade-off between SCC resolution and curve smoothness: smaller Ds values meant better resolution but more jagged SCC curves (and vice versa). Therefore, we decided on a value of 20 ls.
We were interested in capturing the firing temporal properties of the AN fibers. Thus, the SCC was normalized by the number of spike trains N, the average firing rate r, Ds, and the stimulus duration D. This was done by dividing the SCC by N 2 r 2 Ds D, making it independent from these parameters and thus dimensionless. In a normalized SCC curve, a count value larger than 1 means that spikes across spike trains tend to be temporally correlated; a count value of 1 shows that there is no (stimulus-induced) temporal correlation; and a count value smaller than 1 indicates anticorrelation . Furthermore, we focused our analysis on a window centered at 0 ms spanning 62 ms (thus 4 ms in total). This is very short compared to the duration of the stimuli used (Sec. III A). Thus, it was not necessary to correct the SCC curve for the distortion caused by the finite duration of the stimulus. An example SCC curve is shown in Fig. 3(A).

Centrality weighting
Different physiological studies of the mammalian auditory system have found that there is a relatively large proportion of neural coincidence detectors that are more sensitive to interaural delays of smaller magnitude (Kuwada et al., 1997;Kuwada et al., 1987;Kuwada and Yin, 1983;Yin et al., 1986). We accounted for this by introducing a centrality weighting function. The latter served two purposes: to reduce the probability of choosing SCC features for the ITD computation that might be ambiguous (Sec. II B 3) and to emphasize the range relevant for human perception.
Several centrality weighting functions have been developed based on psychoacoustic data (e.g., Colburn, 1977;Shackleton et al., 1992;Stern and Shear, 1996). Based on our previous experience, we chose the function proposed by Stern and Shear (1996). This function p depends on the delay s and is weakly dependent on the stimulus (center) frequency f, as shown in Eq. (1),

Peak selection
We were interested in computing the imposed ITD. We decided to estimate it as the delay value of the maximum peak 1 of the weighted SCC curve, (i.e., the delay value with the largest temporal correlation), as shown in Fig. 3(C).

C. Decision device
For each case, the coincidence detection stage was run 100 times, each time receiving a different set of bootstrapped spike trains (Sec. II A) and yielding as a result a computed ITD value. Then, we generated distributions of the model computations for different ITDs. Said distributions had a time resolution given by the chosen bin size (20 ls, Sec. II B 1). In order to make them continuous and smoother, we fitted a cubic spline using the original data points as anchors. The smoothened distributions were scaled to make sure that the area under the curve added up to the number of ITD computed values (100). Example distributions are shown in Fig. 4.
Analogously to a behavioral alternative-forced-choice paradigm, we defined the case where ITD ¼ 0 as the reference condition and the case where ITD 6 ¼ 0 as the experimental condition. For each experiment, we computed the detection index (d 0 ) metric (Green and Swets, 1966), given by Eq. (2), Then, we built a neurometric function by computing d 0 across different ITD values (namely, 10, 20, 40, 80, 160, and 320 ls, Fig. 5). Afterwards, we fitted the sigmoid function given by Eq. (3) across these points, where a is the bottom limit of d 0 (which we constrained to be not smaller than 0, which translates to a 50% chance of correctly discriminating between the reference and the experimental distribution), b is the top limit of d 0 (which we constrained to be not larger than 4.65, which translates to a perfect discrimination between the reference and the experimental distribution, Macmillan and Creelman, 2004), c is the ITD exp value that yields a d 0 value corresponding to 50% of the range of the fitted curve, and d is the slope. The values for these parameters were obtained using a nonlinear regression with iterative least squares estimation. Finally, the ITD threshold was computed as the ITD value corresponding to a point of the fitted sigmoid with a specific d 0 value. The latter was matched to that of the method used in the behavioral experiment that we wanted to model (Table I). It is worth noting that the predicted ITD threshold is output in ITD units (i.e., ls).

A. Framework validation
We validated the framework's performance in predicting ITD thresholds by comparing its predictions against behavioral data from different studies reported in literature. We were FIG. 3. (Color online) Different stages of the coincidence detection block for a pure tone with f ¼ 1 kHz and an imposed ITD of 160 ls. (A) The raw SCC curve. (B) The centrality weighting function p (adapted from Stern and Shear, 1996). (C) The weighted SCC curve. The ITD was obtained from the latter as the delay value of the maximum peak (dotted line). interested in studies that included NH (Sec. III A 1) and HI (Sec. III A 2) participants and that used narrow-band stimuli (e.g., pure tones, bandpass noise) with ITDs being the only binaural cue present. Table I shows the participants' hearing status, number, and age from the chosen data sets, as well as the characteristics of the stimuli used. In each case, we fed stimuli with these exact same characteristics to the framework. Furthermore, we also generated ITD threshold predictions for additional cases, complementing the behavioral data with computational modelling results. These are shown in Table I in italics. Unless stated otherwise, we show the geometric mean and the geometric standard deviation (Kirkwood, 1979) as error bars. In all cases, we show the ITD threshold values on a logarithmic scale (Saberi, 1995). Figure 6 shows the model's predictions for pure tones in NH listeners. It includes data reported by Brughera et al. (2013). For the latter, we show the median of the ITD threshold for each frequency (together with the original data). When the participants were not able to perform the task, we considered an ITD threshold of þ1 ls. We can see that the predicted ITD threshold values follow the trend of the behavioral data. Figure 7 shows the model's predictions for bandpass noise in NH listeners. It includes data reported by Gabriel et al. (1992), Hawkins and Wightman (1980), Smoski and Trahiotis (1986), and Spencer et al. (2016). Just like for the pure tones, we can see that overall the framework yields good predictions of the behavioral data. Additionally, the computational modelling predictions suggest that ITD thresholds are the lowest somewhere between 500 and 1000 Hz of the center frequency. Figure 10(A) shows the corresponding dispersion plot. Given the low number of participants in each study, we pooled all data together.

HI listeners
One of the most attractive properties of the Zilany et al. (2014) and Zilany et al. (2009) model is that it is capable of modelling IHC and OHC damage. Physiological studies have revealed that IHC deterioration causes elevation of the AN fiber tuning threshold curves (Liberman and Dodds, 1984), while OHC deterioration additionally causes broadening. Furthermore, OHC damage has also been associated with the reduction in two-tone of AN responses and reduction in the compression of the basilar membrane responses (Liberman, 1984;Liberman and Dodds, 1984;Miller et al., 1997;Robles and Ruggero, 2001;Salvi et al., 1982). The framework incorporates these effects of OHC and IHC impairment using two scaling constants: C OHC and C IHC (Bruce et al., 2003;Zilany and Bruce, 2006), respectively.
C OHC is introduced at the output of the control path. A value of C OHC ¼ 1 represents normal OHC function. This allows for normal behavior of the non-linear basilar membrane filter, yielding narrow and low tuning thresholds curves and output compression. The closer C OHC gets to 0, the larger the impairment of the OHCs. This modifies the behavior of the basilar membrane filter in two ways: at low sound levels, it increases the tuning curve bandwidth and In this case, the ITD threshold was computed as the ITD value corresponding to d 0 of 1.5 (equivalent to 79% of the psychometric function, as used by Brughera et al., 2013), which resulted in a value of 37.8 ls.
FIG. 6. (Color online) Validation of the modelling framework performance for predicting ITD thresholds for pure tones in NH listeners using the behavioral data reported by Brughera et al. (2013) as a reference. (A) Original data and its median compared with model predictions. The latter correspond to one simulated NH listener (and thus have no error bars). The latter were shifted 25 Hz along the x axis for clarity. Model predictions corresponding to behavioral data are shown with filled symbols. Additional model predictions are shown with empty symbols. The dotted line represents the highest frequency for human ITD discrimination measured by Brughera et al. (2013). (B) Dispersion plot of predicted vs behavioral ITD thresholds. The purple line considers data for all frequencies. The blue line considers data for frequencies below 1300 Hz. elevates the thresholds; at moderate to high levels, it reduces (or eliminates) the output compression.
C IHC is introduced in the signal path. Analogous to C OHC , a C IHC value of 1 represents normal IHC function, while values closer to 0 represent larger IHC impairment. This is modeled by lowering the slope of the function that relates basilar membrane vibration to IHC potential, which causes elevated threshold tuning curves.
The model calculated the C OHC and C IHC values for each listener using as an input his/her reported PTA. It attributed 2/3 of each threshold shift to OHC impairment and the remaining 1/3 to IHC impairment (this proportion is in line with previous studies regarding hearing loss in cats and estimated OHC/IHC detriment in HI listeners in average, Bruce et al., 2003;Lopez-Poveda and Johannesen, 2012;Plack et al., 2004). Specifically, it calculated the listeners' C OHC / C IHC by comparing their OHC/IHC threshold shift with a set of previously computed C OHC /C IHC values based on the measurements by Shera et al. (2002). A complete description of their computation is given by Bruce et al. (2003) and Zilany and Bruce (2006). Figure 8 shows C OHC /C IHC values as a function of auditory thresholds for different OHC/IHC proportions and for two CFs (500 and 4000 Hz). Figure 9 shows the model's predictions for bandpass noise in HI listeners. It includes data reported by Gabriel et al. (1992), Hawkins and Wightman (1980), Smoski and Trahiotis (1986), and Spencer et al. (2016). Overall, the framework is capable of following the trends of behavioral data. This is true for data from all studies, except for the case of the Gabriel et al. (1992) dataset, where the model is unable to predict an increase of the ITD threshold at 2000 Hz. Additionally, just like in the NH case, the computational modelling predictions suggest that ITD thresholds are the lowest somewhere between 500 and 1000 Hz of the center frequency. It is worth mentioning that, in most cases, the performance of HI listeners shows a large variability, which is reflected by large error bars. Figure 10(B) shows the corresponding dispersion plot. Similarly to the NH case, we pooled all data together.

B. Contribution of OHC/IHC loss to HI
Current literature suggests that although both OHC and IHC impairment are responsible for cochlear hearing loss, OHC damage contributes in a larger proportion to the measured threshold shift on average (Bruce et al., 2003;Moore, 2007;Zilany and Bruce, 2006). However, evidence shows FIG. 7. (Color online) Validation of the framework performance for predicting ITD thresholds for bandpass noise in NH listeners using behavioral data reported in different studies as a reference. Error bars denote 6 1 geometric SD across listeners and are reported where available. Model predictions correspond to one simulated NH listener (and thus have no error bars). The latter were shifted 50 Hz along the x axis for clarity. Model predictions corresponding to behavioral data are shown with filled symbols. Additional model predictions are shown with empty symbols. that this proportion varies broadly individually across listeners, even among those with identical thresholds (Liberman, 1984;Lopez-Poveda and Johannesen, 2012). Furthermore, anatomical studies have shown that the amount of OHCs and IHCs can vary widely across individuals (Mcgill and Schuknecht, 1976;Wright et al., 1987). Thus if we wish to individualize predictions, more listener-specific factors may have to be taken into account. Therefore, we explored the effects of different proportions of hair cell loss on ITD discrimination. Besides the original 2:1 proportion attributed to OHC/IHC (Sec. III A 2), we investigated the cases when the proportion was 1:0, 1:2, and 0:1. We simulated two HI listeners with a mild and a moderate impairment with flat threshold shifts of 25 and 40 dB hearing level (HL), respectively, across all frequencies (plus a NH listener as a reference). As a stimulus, we used bandpass noise with characteristics defined in the last row of Table I.
These results are shown in Fig. 11. On one hand, we can see that in the 25 dB HL case (panel A) there is a slight increase of the thresholds compared to the NH case. However, the curves show little spread, making it hard to see the effect of different OHC/IHC proportions. On the other hand, we can see that in the 40 dB HL case (panel B) not only are the thresholds higher, but the curves have a larger spread. These curves suggest that there is an effect of the different OHC/IHC proportions for frequencies above 500 Hz. Furthermore, it looks like the impairment of OHC has a larger impact on ITD detection: when the auditory threshold shift is completely attributed to the OHCs, the ITD thresholds are the worst (i.e., highest). FIG. 9. (Color online) Validation of the framework performance for predicting ITD thresholds for bandpass noise in HI listeners using behavioral data reported in different studies as a reference. Error bars denote 61 geometric SD across listeners and are reported where available. Model predictions correspond to the simulated HI listeners (i.e., C OHC < 1, C IHC < 1). The latter were shifted 100 Hz along the x axis for clarity. Model predictions corresponding to behavioral data are shown with filled symbols. Additional model predictions are shown with empty symbols.

IV. DISCUSSION
In this work, we introduced a computational framework that uses a physiologically inspired model of the AN as a front end and a neurometric decision device as a back end to predict ITD thresholds in NH and HI listeners. We validated its performance by comparing the predicted ITD thresholds for narrow-band stimuli against behavioral data reported in literature. Additionally, we investigated the effect of changing the proportion of impairment attributed to different OHC/IHC combinations.

A. Framework structure
The presented computational framework has several advantages. First, the model of the auditory periphery used as a front end (Zilany et al., 2014;Zilany et al., 2009, Sec. II A) is inspired by the anatomy and physiology of the AN and has been validated with a wide range of physiological data. Furthermore, being able to model the individual elements of the auditory pathway allowed us to study them individually. In this particular case, it allowed us to incorporate sensorineural hearing loss as impairment of either the OHCs or the IHCs, something which would not be possible if the representation of the AN was coarser.
Regarding the coincidence detection stage, the SCC Louage et al., 2004) basically consists of tallying spike intervals. This process can be thought as a natural display of the information arriving from the auditory periphery (Joris et al., 2005). In other words, it extracts information by comparing temporal coding across contralateral AN fibres within an integration window. Our results show that such a relatively simple metric is enough to model ITD discrimination. Additionally, the SCC helped us transform a discrete representation of neural activity (i.e., spike trains) into a continuous representation (i.e., the SCC curve itself), allowing us to further manipulate said representation accordingly (centrality weighting) in order to compute an ITD value (peak selection).
Finally, the implemented neurometric decision device used the distribution of these computations to predict an ITD threshold value. It is worth emphasizing that the latter was given in time units (ls in this case), which made the comparison between the model predictions and the behavioral data intuitive and straightforward. Using the d 0 metric allowed us to incorporate information about the framework's ITD computations and their dispersion (i.e., variability). Furthermore, the framework's modular design allowed the use of different building blocks at any of its different stages. For instance, the current auditory periphery block could be substituted by any other model that takes a pressure wave as an input and yields spike trains as an output. This is not restricted to models of purely the AN. Models of electrical stimulation could be used to investigate binaural hearing in bilateral cochlear implant users (e.g., Prokopiou et al., 2017).
During the framework's development, we were careful to comply with the suggestions proposed by Colburn and Durlach (1978).
(1) Relate the model's assumptions and parameters with known physiological data. Our framework included a front end that mimics the response of the AN based on physiological information.
(2) Avoid overfitting of the model's predictions due to excessive parametrization. The framework's parameters were not tuned in a per-case basis. They were defined a priori and were the same for all conditions of all simulated experiments [except for the decision device criterium (Sec. II C), which was configured according to the simulated behavioral experiment]. Furthermore, we kept the number of parameters to a minimum. (3) Describe quantitatively how the stimuli are processed and how they are affected by (internal) noise. The modular nature of our approach allowed us to examine not only the final output, but also the outputs of the intermediate stages (Sec. II). Additionally, the framework's internal noise (e.g., from the auditory periphery, from the coincidence detection) was taken into account by the decision device to predict the ITD thresholds. (4) Consider perceptual principles when modelling higher, more central portions of the system. The implemented decision device (Sec. II C) models psychophysical aspects of human perception (Green and Swets, 1966). (5) Compare the model predictions with relevant data. We validated the framework with appropriate behavioral datasets previously reported in literature (Sec. III A).

B. Framework validation
The proposed framework was able to predict the trends of behavioral data. In the NH case, we predicted ITD thresholds for pure tones and bandpass noise. For the former, we used the dataset from Brughera et al. (2013) as a reference. Figure 6(A) shows that although the framework tended to overestimate the FIG. 11. (Color online) Effect of attributing different auditory threshold shift proportions to OHC/IHC impairment for bandpass noise. Its corresponding parameters are given in the last row of Table I. ITD thresholds, it was still able to capture the data trend: low thresholds in the mid-frequency range (700-1000 Hz) and higher thresholds in the low (250-500 Hz) and high (1200-1400 Hz) frequency ranges (with a considerable nonmonotonicity at $1300 Hz). However, the model does not increase its ITD threshold output as fast as the behavioral data. Additional computational simulations show that the model is only unable to "perform the task" (i.e., output ITD threshold values that were too large) above 1500 Hz. We hypothesized that this was because the auditory periphery stage is unable to capture the loss of phase locking for frequencies above 1300 Hz fast enough. We confirmed this after computing the vector strength (Johnson, 1980) for the spike trains across different frequencies. We obtained vector strength values that slowly decreased monotonically with increasing frequency with an abrupt decrease only at $1500 Hz. Therefore, the model finds its performance limit at higher frequencies than we would expect.
In the case of bandpass noise, we used different datasets (Fig. 7). The predictions of the datasets from Smoski and Trahiotis (1986) and Spencer et al. (2016) were very close to the reported mean values. The predictions of the datasets from Hawkins and Wightman (1980) and Gabriel et al. (1992) 2 were also able to follow the data trend, although thresholds were overestimated. The model overestimation of the NH thresholds was caused by the SCC resolution. We chose a bin size of 20 ls (smaller values yielded noisy curves leading to incorrect predictions and larger values were too far from human performance). Therefore, the distributions of short experimental ITDs (e.g., 10 or 20 ls) overlapped largely with the distributions of the reference condition (ITD ¼ 0 ls), yielding low d 0 values. After analysis of different neurometric curves, we found that the points corresponding to these short ITDs were responsible for decreasing the slope of the sigmoid fit of the neurometric function, causing the decision device to output higher ITD thresholds. Additionally, the modelling predictions revealed an underlying function that suggests that the best performance (i.e., lowest thresholds) is achieved when the bandpass noise center frequency is between 500 and 1000 Hz. However, this would need to be confirmed with further behavioral measurements.
It is worth mentioning that the model is not able to fully account for the performance detriment between 1000 and 2000 Hz of the Gabriel et al. (1992) data (which is the only study that includes behavioral data for NH listeners at that frequency). The model shows increased thresholds between these two frequencies, but this increase is not as large as in the behavioral case. We attribute this to the model still being sensitive to the temporal fine structure of the stimulus at 2000 Hz. We investigated this by computing the difcor (Louage et al., 2004). The peak of the difcor reflects the temporal fine structure coding strength of the spike trains (and thus of the model). First, we computed the SAC for the left ear spike trains. Then, we computed the SCC between the left and the right ear spike trains. However, the latter corresponded to a polarity-inverted version of the stimulus. Finally, we subtracted them bin by bin. We found that at 1000 Hz, there was strong temporal fine structure coding, with a peak of 3.44. This value decreased only to 2.69 at 2000 Hz. A larger loss of synchrony would result in larger (and thus more accurate) predicted thresholds, as it is in the case at 4000 Hz (with a peak difcor value close to 0).
In the HI case, we predicted ITD thresholds for bandpass noise from the same datasets as in the NH case (Gabriel et al., 1992;Hawkins and Wightman, 1980;Smoski and Trahiotis, 1986;Spencer et al., 2016). Figure 9 shows that the framework was able to predict the behavioral trend at a group level. Low center frequency values yielded low ITD thresholds, which the model was able to predict accurately. Increasing center frequency values yielded higher ITD thresholds. The framework was able to capture this trend, although its predictions tended to be lower than the actual behavioral data (in contrast to the NH case). However, these results have to be handled with care, since we cannot affirm that group (i.e., mean) results can be translated to each individual participant (Akeroyd, 2014). High intersubject variability is a common issue in most HI studies, including the ones presented here. This variability could be attributed to a variety of reasons, being low correlations between PTA and listeners' performance in ITD threshold detection tasks one that is commonly suggested (Moore, 2007). We used PTA information to model OHC/IHC impairment (i.e., to compute C OHC and C IHC values, Sec. III A 2). We believe this is a good first step towards incorporating HI in our framework. Additionally, just like in the NH case, the modelling predictions suggest that lowest thresholds are achieved when the bandpass noise center frequency is between 500 and 1000 Hz. Likewise, this would need to be confirmed with further behavioral measurements (similar to those performed by Gabriel et al., 1992).

C. Contribution of OHC/IHC loss to HI
In the case of the simulated listener with mild impairment [25 dB HL, Fig. 11(A)], we saw that the thresholds were not so different from those of a NH listener. This was expected since thresholds of 25 dB HL are just below the conventional definition of NH of 20 dB HL across all frequencies. Additionally, the stimulus level was well above said threshold.
Of more interest is the case of the simulated listener with moderate impairment [40 dB HL, Fig. 11(B)]. These results hint that the impairment of OHCs has a larger impact on the thresholds than impairment of the IHCs. OHC impairment is modelled by the Zilany et al. (2014) and Zilany et al. (2009) front end as loss of compression, broader tuning, and elevated thresholds (i.e., decreased gain) of the control path (Zilany and Bruce, 2006). These changes in the input-output function as well as in the bandwidth are responsible for the detriment of the temporal coding of the neural fibers. Furthermore, the compression parameters of the model are frequency-dependent (Zhang et al., 2001). The filters' responses are almost linear for frequencies below 500 Hz, while having compressing nonlinearities for frequencies above 500 Hz. This explains why we do not see an effect of different OHC/IHC proportions at lower frequencies. Lopez-Poveda and Johannesen (2012) showed that the OHC/IHC dysfunction contribution to hearing impairment can vary largely across cases, even in listeners with comparable audiometric losses. These results together with the presented model simulations could partially explain the variability observed in the performance of HI listeners in ITD discrimination tasks.

D. Comparison with existing models
Ideally, we would like to quantitatively compare our framework's performance with other similar existing models. Unfortunately, doing such a systematic analysis is not a trivial task (e.g., Ashida et al., 2017;Saremi et al., 2016) and would require a more homogeneous benchmark (Dietz et al., 2018), which is not available for all models. Therefore, further discussion uses a qualitative approach.
First of all, it is worth mentioning that the idea of including physiologically inspired concepts as components of larger models is not new. For example, Patterson et al. (1995) used the IHC model proposed by Meddis (1988) (which simulates neurotransmitter flow across reservoirs) to convert the basilar membrane motion (coming from either a gammatone or a transmission line filter) into a neural activity pattern. The Zilany et al. (2014) and Zilany et al. (2009) model has been used as a front end for studying the effect of sensorineural hearing loss on speech intelligibility (Heinz, 2015) and on sound localization in the median plane (Baumgartner et al., 2016). Moreover, very recent work has used that same front end to study ILD perception. The model proposed by Brown and Tollin (2016) subtracted simulated left and right ear AN spike trains within a running temporal window. This simple model was good enough to validate their physiological and psychophysical observations of ILD sensitivity and robustness. Laback et al. (2017) developed a more elaborate framework. They used the simulated spike trains as an input to an interaural comparison stage where they evaluated the difference in mean discharge rates between left and right ear AN inputs and the variability of these rates over different stimuli presentations. Their model was able to account for a variety of published ILD perception data as well as their own.
Nonetheless, few models have used a physiologically inspired architecture to study ITD discrimination. To start with, in our previous work (Prokopiou et al., 2017) we investigated binaural temporal discrimination of NH and bilateral cochlear implant users with different stimuli using physiologically based front ends. However, we did not account for the large proportion of neurons sensitive to smaller ITDs (incorporated in the presented framework by using a centrality weighting function, Sec. II B 2). The ITD threshold estimation was done using a novel binary classifier characterisation device. Although the latter yielded good data trend predictions, its approach was more analytical (rather than directly modelling psychophysical procedures). Additionally, the model predictions were given in arbitrary model units (rather than in proper time units [ls], like the current framework), which made the comparison with behavioral data more difficult. Furthermore, we did not investigate the (unaided) HI case. Hancock and Delgutte (2004) used a neural-pooling model to simulate ITD-sensitive IC neurons to broadband noise. They used gammatone filters to model the processing of the auditory periphery all the way to the IC. Then, they generated a population model which they parametrized with their own physiological data collected in cats. At a singleneuron level, they found very good agreement between physiological data and model predictions of rate-ITD curves. At a population level, they were able to predict ITD thresholds for pure tones and broadband noise. However, using a filterbank to simulate the auditory periphery does not include important details of its physiology and limits is capabilities in further expanding it to include HI at this level. Additionally, they included a so-called "efficiency parameter" (e) for the ITD thresholds predictions. This parameter was empirically adjusted to account for several factors such as stimulus variability, inefficient pooling process, or deficient decision making, but had no physiological relevance. They acknowledged that this approach gave the absolute ITD thresholds predicted values little significance (the comparison with behavioral data was not so straightforward) and thus focused in the trends of ITD threshold changes. Brughera et al. (2013) minutely investigated ITD threshold discrimination in NH listeners using pure tones across different frequencies, with special focus in the range between 1200 and 1400 Hz. Additionally, they predicted this data set using two different variations of a Hodgkin-Huxleybased MSO neuron model: a lateralization centroid model and a rate-difference model. The former was successful in predicting ITD thresholds at high frequencies, while the latter was successful in predicting ITD thresholds at low and middle frequencies. This rate-difference variation could also predict high frequency thresholds, but only when the model was tuned ad hoc. They also proposed using a combination of both types of models in order to predict thresholds across all frequencies. However, neither of these two scenarios (stimulus-specific parameter tuning and/or processing) are desired, since they could very well lead to overfitting.

E. Framework limitations
Finally, even though the presented framework was successful in predicting ITD threshold perception trends in NH and HI listeners, there are some shortcomings worth pointing out.
We wanted to investigate the contributions of physiological pre-processing stages shaping the input to the binaural stage. Our results show that using a physiologically based model of the auditory periphery combined with a coincidence detection stage (such as the SCC) together with a neurometric decision device can be sufficient to model simple binaural processes. In other words, combining information from the auditory periphery with a simple metric of binaural coincidence is enough to model ITD discrimination in NH and HI listeners. Other studies have used a similar approach. For instance, Franken et al. (2014) input cat AN and trapezoid body neural recordings to a "bare bones" model of an MSO neuron. This neuron generated an output spike when its inputs occurred close enough in time. They concluded that the fundamental operation in the mammalian binaural circuit is the coincidence counting of (single) binaural input spikes. This type of simple binaural schemes (either based on metrics or on models) go in line with recent physiological studies that suggest that the ITD sensitivity depends on the exact timing of the excitatory inputs to MSO neurons (van der Heijden et al., 2013). Including a more elaborate MSO stage [such as the models proposed by Brughera et al. (2013), Colburn et al. (2009), Colburn et al. (1990, and Takanen et al. (2014)] could still be beneficial. For example, the MSO has been associated with processing of temporal fine structure, while the lateral superior olive has been associated with processing of the envelope (Remme et al., 2014). Including a two-channel structure (similar to the one used by Dietz et al., 2009) could allow us to further investigate the effect of HI in temporal fine structure and envelope binaural perception. It could even help reduce the discrepancies between the model predictions and the behavioral data of high-frequency pure tones, since it has been suggested that synchrony loss takes place there (Brughera et al., 2013). Additionally, modelling the MSO neural activity in more detail could help us discern its contribution to auditory brain stem responses (Ashida et al., 2015). Finally, an MSO stage would be crucial to understand better the processing of (sensorineurally impaired) peripheral input along the binaural auditory pathway.
In the centrality weighting block (Sec. II B 2) of the coincidence detection stage, we used the function proposed by Stern and Shear (1996). This function was obtained using human psychoacoustical data. It would be interesting to explore the effect of centrality-weighting functions obtained using physiological data (e.g., Hancock and Delgutte, 2004;McAlpine et al., 2001). Although these functions were generated using animal data, their use could help in investigating if similar neural distributions exist in humans and to provide a better insight on the underlying assumptions of the psychoacoustic functions.
For the sake of simplicity, we tuned the AN fibers with the same CF as the stimulus frequency (for pure tones) or center frequency (for bandpass noise). Tuning AN fibers to different CFs could allow investigating the effect of crossfrequency integration for ITD perception in NH and HI listeners for broadband stimuli. It could also allow to explore the contribution of off-frequency filters to ITD sensitivity of modulated stimuli (Bernstein and Trahiotis, 2002).
Additionally, we considered ITDs to be the only binaural cue present. However, this is unrealistic, since sounds in a real-world scenario contain ITDs together with ILDs. In order to be able to model this, we would need an additional module that could integrate binaural information across cues. The latter is not a trivial task, since the questions of at what levels of the auditory pathway and to what extent are time and level cues combined are still topics of ongoing research (Brown and Tollin, 2016;Ellinger et al., 2017;Johnson and Hautus, 2010;Palom€ aki et al., 2005;Phillips and Hall, 2005;Takanen et al., 2014). Furthermore, the framework fully attributed HI to hair cell damage (Sec. III A 2). However, there are additional factors that may affect binaural perception, such as age, cognition, and loss of AN-hair cell synapses (Gallun et al., 2014;Kujawa and Liberman, 2015).
Last, it is worth mentioning that several assumptions on which the framework's blocks rely on have been validated using animal data. Their translation to human auditory perception still needs to be confirmed by further physiological and psychoacoustical studies (e.g., Salminen et al., 2018).

V. CONCLUSIONS
We presented a physiologically based modelling framework capable of predicting ITD thresholds for NH and HI listeners. It uses the model of the AN proposed by Zilany et al. (2014) and Zilany et al. (2009) as a front end and a coincidence detection stage based on the SCC Louage et al., 2004) together with a neurometric decision device as a back end. The framework was validated by comparing its predictions with behavioral data from literature, showing good agreement between them. These results show that the presented framework is capable of modelling ITD discrimination of NH and HI listeners at a group level. Additionally, we used it to study the contribution of OHC/ IHC loss of HI listeners. Model results hint that OHC impairment has a larger impact on ITD discrimination thresholds than damage to the IHCs for frequencies >500 Hz. was unable to predict the frequency region of best performance. Therefore, computing the maximum peak was a more suitable choice for our application.