Brain machine interfaces: Neuron Processor Interface
Lieuwe B. Leene, Yan Liu, Timothy G. Constandinou
Department of Electrical and Electronic Engineering, Imperial College London, SW7 2BT, UK
Centre for Bio-Inspired Technology, Institute of Biomedical Engineering, Imperial College London, SW7 2AZ, UK
A core aspect of emerging neuroscience is quintessentially performing real-time data analysis at a massive scale. However when we observe its manifestation in state-of-the-art neural interfaces we find the hardware is very limited to specific methods that can be objectively short-sighted. This chapter aims to direct our attention to a different point of view with respect to how these sensor systems can be structured. In particular we are guided by the concept where an implant is capable of performing software defined instrumentation. This is associated with a focus that lies with enabling real time & in-vivo testing of a more diverse set of signal characterization methods. More importantly we will demonstrate that this can be made feasible for large scale distributed systems.
This particular approach is motivated by a number of factors that aim to increase performance and enable research opportunities. The first is that many aspects with regard to the signal quality of an implant cannot be predicted beforehand. As a result implementing a specific algorithm for specific signal characteristics may lead to failure or an overly conservative design because the environment can potentially be excessively noisy. By introducing the capacity to dynamically execute different processing methods on neural data the implanted system to use either LFP and EAP activity in real-time. This may be a significant element to improving the success for chronic BMI implants. Moreover the prolific development in characterization methods used for decoding neural data inhibits a general consensus for DSP techniques. This prevents a single method and corresponding architecture to be applicable in most scenarios. The second factor is that this approach conceptually enables the development of real-time resource constrained algorithms which are virtues often neglected when working with data sets. Currently most BMI development platforms have limited capabilities to allow algorithms to use external or multi-modal features to inform local operation and simultaneously provide recordings from hundreds of electrodes. This construction may be a key factor to allowing high level algorithms to directly manipulate machine learning parameters local to each implant. This hierarchical fashion should improve the efficiency of distributed BMIs for decoding information. In contrast we question the feasibility for scaling the current supervised methods that require fine measurements of each electrode’s recording to approach a optimal decoding strategy. Typically the computational efficiency of this approach remains exhaustive when reconfiguring sensor parameters because it use a centralized unit that recalibrates all recording channels in an elaborate fashion.
This chapter is organized as follows; Section 31 motivates localized processing for increased efficiency and estimates to what extent we can perform on-chip processing. This is followed by Section 32 where typical methods used for neural signal analysis are introduced and the respective hardware complexity is demonstrated in Section 33. This leads into the proposed distributed processing architecture in Section 38 where the design is discussed with respect to the implementation. Section 41 demonstrates the realization of this platform. Finally Section 42 draws conclusions with respect to the digital approach to neural instrumentation.
31 Processing at the interface
Ideally a neural interface device is tasked with recording from a large ensemble of electrodes and transmitting information with the lowest bandwidth because the harvested power is a scarce resource for implants inside the body. However over the course of an implant’s life time most signal characteristics are dynamically changing which implies that there should be an involved learning process that similarly adapts to these changes. This can also mean that the output bandwidth is constrained by the total amount of mutual information that can be retained within the device. Such a device will predict the expected recording from one time interval to the next and differentiate any new information that needs to be transmitted. Hence we should be convinced that the processing capacity or complexity for a closed or memory limited system should reflect in its fundamental ability to store information 1.
In order to capture some high level trends with respect to processing requirements let us normalise memory capacity in terms of state variables that is independent of modality. This is particularly useful because the number of state variables in a dynamic process is a good indicator for complexity whether is a digital classifier or an analogue filter. Here we will exclusively focus on processing by assuming the signal being operated on is idealized with respect amplitude and its representation. This extends from our analysis in Section\ref{ch:T1_model} by elaborating specifically on comparing digital and analogue resource allocation associated with processing.
$$ R_{A} = \underbrace{\frac{2\pi BW kT SNR^2 U_T}{V_{DD} L}}{power} \cdot \underbrace{\left( \frac{A{min}}{L} +\frac{kT SNR^2}{L Vdd^2 C_{dens}} \right)}_{Area} $$
If we represent the resource required as the power area product for a state variable then in the analogue domain it would be represented by Equation 32. Here \(BW\), \(V_{DD}\), \(C_{dense}\), \(A_{min}\) reflect the signal bandwidth, supply voltage, capacitor density, and typical transconductance area overhead for a particular technology respectively. \(L\) is a normalized feature size that allows us to evaluate parameters for a particular technology and extrapolate them based on constant field scaling factors.
$$ R_{D} = \underbrace{ 2BW \alpha \log_2(SNR) C_{gate} V_{dd}^2 L^2 }{power} \cdot\underbrace{ \alpha \log_2(SNR) A{gate} L^2}_{Area} $$
Similarly Equation 33 represents the power area product for a digital state variable. \(C_{gate}\), \(A_{gate}\), \(\alpha\) parametrise typical gate capacitance, area, and overhead for each register respectively. Generally the dependency of both parameters \(R_A\) and \(R_D\) are well understood and guide maximizing system efficiency in an abstract sense 2.

Figure 49: Impact of technology on \\(R_A\\) analogue (green) and \\(R_D\\) digital (blue) processing resource requirements extrapolated from a \\(180 nm\\) CMOS technology under constant field scaling.
For neural instrumentation however both power and area requirements must be highly constrained in order to realize a device that can accommodate a large number of recording channels and remain implantable. Figure 49 shows the resource requirements for processing in the analogue and digital domain with respect to signal fidelity and CMOS technology. Either approach can present an advantage over the other under specific conditions. Digital systems appear favourable beyond \(65 nm\) CMOS where analogue will do better at lower SNR conditions given a technology with a larger feature size. The discussion in Section \ref{ch:T1_model} suggested the analogue preconditioning requires a resource allocation of $10^{-15} Wm^2$ with a weak dependence on technology. Moreover if quantization is not considered then power can be entirely determined by the noise specification and the area requirements are dependent on the gain configuration. Comparing this figure with the estimate on \(R_D\) indicates that we should be able to integrate a considerable amount of processing capabilities before the DSP uses a comparable amount of resources. This is important because improving on-chip processing capacity ideally results in requiring less supervision and a lower wireless communication bandwidth.
While we may expect other sources of over-head and extra power dissipation components, we should instead take a moment to consider the implications of this result. Particularly when considering the claim that electronic sensing of brain activity on larger scales is not viable due to the excessive data rates derived from the principle entropy relations and the associated communication bandwidth 3. Clearly any degree of on-chip processing undermines this limitation because it enables us to achieve data rates far lower than that of the electrical signals by finding a more appropriate basis of encoding information. On the other hand it does raise an important point regarding the relationship of the generated output data rate and the recorded signal to noise ratio. In some sense we are simply faced with the challenge of best consolidating the recorded information towards high level indicators for specific objective functions. This allows us to approach data rates extracted from current BMI studies which are negligible in comparison to the Nyquist rates. In this light we argue that electronic methods for recording activity are the closest to realizing a viable neuroprosthetic solution in the near future when comparing optical, magnetic and less invasive BMI architectures.
Now we can make the assertion that there should be two approaches to solving the system level challenge of integrating wireless neural instrumentation systems. The first approach would be a mixed signal topology that extensively uses analogue processing such that technology has a weak impact on improving efficiency. Instead the critical component lies with the effectiveness of analogue dimensionality reduction. In such a case we need to adopt a well established algorithm that can accommodate analogue variability an still deliver exceptional signal characterization. The second approach is to rely on digital methods that deliver robust and reconfigurable compute resources that scale well with technology. This should grantee the capacity for a variety of fully adaptive algorithms capable of extracting multi-modal characteristics from recordings. This could much more valuable for experimental neuroscience at this point in time. Unfortunately not all forms of algorithms can use the low power characteristics of processing in the analogue domain. Moreover they are typically limited by underlying assumptions regarding noisy perturbations. When we introduce different contexts of operation reconfiguring the analogue is not done as arbitrarily as a digital structure would. For this reason we will adopt the digital approach in order to leverage robust reconfigurable capabilities and reconsider the analogue approach in Section 48.
The significance here is that these trends allow us to roughly estimate the complexity of algorithms for different technologies if their resource requirements made to be equivalent to that of the instrumentation circuits. We show in Figure 50 that using a $0.18 \mu m$ CMOS process should give way to approximately 100 state variables or equivalently perform about 100 operations per sample taken. In fact looking at image processors that similarly rely extensively on data intensive post processing we can see an identical dependency on technology scaling as we have predicted for various levels of digital performance at different technology nodes. It is important to note that the normalized efficiency evaluated here is in fact independent of signal bandwidth and only depends on signal to noise ratio and its relation to the supply voltage.
Figure 50: Analytic number of digital operations available with respect to different technologies (red) with references to the normalized performance of image processors (blue).
$$ P_{system} = \underbrace{ N_{channel} \cdot \left( P_{Algo} + P_{Transmit} \right) }{In : Channel} + \underbrace{ P{Control} + P_{Comms} }_{System : Level} $$
System level design of a embedded processing system for BMIs should be guided entirely by the optimization of compute power efficiency. As shown by Equation 34 we expect are two primary components in the system power breakdown. The over all objective should lie with minimizing the channel level power dissipation of the algorithm \(P_{algo}\) by increasing that of the system level control \(P_{Control}\) such that the component that scales with channel count is reduced. Secondly we should keep in mind that reducing in processed output data alleviates the dissipation of the on-chip communication \(P_{Transmit}\) and the external telemetry power consumption \(P_{Comms}\).
32 Methods for Neural Signals
A key component to developing this platform is a discussion on the diverse set of signal processing methods performed on neural data and their computational requirements in order to determine our system’s specifications. More importantly we want to judge what is the expected complexity for some of these operators and how many variables are allocated during each process. There are in principle four different categories for the methods that are applied to process neural signals which are listed below. In practice, a single integrated system will utilize a multitude of different techniques to achieve denoising and feature extraction.
\textbf{Pre-Processing} is the filtering and conditioning of each ADC output sample using FIR, IIR, or non-linear filters. Here the objective is simply to de-noise the features or signal components. The resulting signal allows for better detection or more precise evaluation of the signal characteristics and is often closely related to the characteristics of the instrumentation circuits. This output may be considered as the raw signal recording that is used to bench mark any post processing methods or other instrumentation systems.
Detection is associated particularly with capturing the intermittent spike events. There is no quantitative evaluation made with regard to the nature of the detected spike but any detection may trigger the process that records contiguous samples around the detection event that are then subsequently characterized. These events are commonly triggered upon simple threshold crossings of the signal or its integral power over several samples. In some systems the interest lies only with the accurate detection of spike events which is sufficient to perform closed loop therapeutic treatment or control actuators external to the body. Data reduction can be interpreted either as representing a spike waveform in terms of defining features or with a reduced basis to allow approximate reconstruction by using spike amplitude or wavelet decomposition. Representing waveforms in terms of their quintessential components allows for more efficient post processing and reduces data rates in the case of wireless telecommunication. In its primitive sense this is simply dimensionality reduction of the recorded data and is usually followed by supervised or unsupervised signal classification.
Classification is the predominant objective for BMIs and is the main difficulty to realize inside a embedded recording system. Such a task in spike based systems primarily performs a generalization of the detected spike shape in terms the previously detected neurons. This reflects the fact that in most cases multiple neurons can be detected by a single electrode and by distinguishing these events the integrity of information can be preserved. The objective here lies with having an equivalent spike event output as simple detection to perform actuation but with better fidelity.
It should also be clear that the above operations primarily focus on reducing the recorded signal to its primitive components in terms of spike events at the rate of \(100 b/s\) instead of the \(256 Kb/s\) data stream typically generated by the ADC. The processing layer on top of this elementary function will either aim to evaluate neural connectivity or use collections spike rates to perform inference of high-level dynamics. This application specific processing of these systems will not be considered here primarily because the nature of such a problem is very different from the more generic information extraction from recordings. As a result that system architecture should revolve specifically around multichannel trained dimensionality reduction. Even when such a task can adjoin to what is presented here it will be out side the scope of this discussion which targets more generic signal instrumentation.
Figure 51: Estimated resource requirements for different classes of algorithms use for processing neural recordings found in literature.
When we survey the various algorithms found in recent literature and estimate the expected memory/computational requirements we might observe a distribution like that shown in Figure 51. For fair comparison we have adopted these methods to operate on a window size of 32 samples with three types of dectected spike waveforms when applicable and only accounted for memory allocation that cannot be shared across channels. This should give a good normalized indication of the limiting components for each method which we could then further optimize in a more specialized manner. Notice that there is a strong correlation in the memory usage and the required number of operations for most embedded systems.
Some of the most efficient spike detection and feature extraction is associated with using temporal characteristics of the spiking waveforms. Common examples include using the time interval between minimum and maximum peaks or the duration of threshold crossing for detected spikes. The defining characteristic here is that alignment buffers are not needed which leads to using very few operations per sample. Inherently the drawback is the increased sensitivity to noise which implies a very limited capacity to distinguish and classify different spike shapes. Unless more filtering is performed. There are other methods that also operate with reduced signal buffers. For instance using compressed sensing where signals a continuously re-projected with a sensing matrix. This requires a few accumulators for each coefficient being extracted but this is strictly data compression. Ultimately This may not help directly with classification or signal characterization that must now be performed off-chip before any benefit can be realized.
Many other classes of algorithms operate on a windowed basis that exploit learned mean spike shapes that are expected in the recording. Here a convolution or distance operator will indicate which class of spike is detected. For terminology let each convolution of the signal result in a feature that is used for detection and/or classification. The adaptive component of these methods leverage a significant amount of noise shaping and separation depending on what the objective function entails when the basis for convolution is being determined. The most prevalent approach is convolving with principle components which simply maximizes the signal variance in the projected space. In contrast to using temporal feature that struggle with sample limited denoising, the windowed operators should be more robust towards noise. Instead there is some difficulty in systematic alignment of the window with the spike. This aspect often motivates increased sampling rates or interpolation in order to perform accurate alignment.
The objective function for determining the convolution kernels can be oriented towards maximising sparsity4, signal to noise5, cluster separation6, or expectation maximization7 each reflecting different signal modalities and analysis methods. Although the complexity for training can be varied extensively the local operations for classification after adaptation is almost equivalent. This operation is the \(F\) linear projections using \(W\) samples onto the feature space where \(F\) and \(W\) are the number of features used and the number of samples in the window respectively. There may be some deviation from this operation if we also consider the different confidence intervals for each class is taken into account. This can be done by evaluating centroid distances in terms of the variance for each respective centroid. Conversely to such training, generalized templates averaged over a number of recording channels may also be used for feature extraction in order to share the memory requirements at the loss of not achieving maximally separated clusters for each individual channel.
Naturally it is challenging to objectively judge feature extraction and classification methods. We could always reduce the dimensionality of the search space or by simplifying the convergence strategies to reduce memory and computational requirements. For this reason the details in Figure 51 should be considered in relative terms due to the generalizations made with regard to system specifications. As many methods in the literature are not performed at the sensor interface they typically will not take advantage of processing on a sample to sample basis and opt for a batched or sequenced processing methodology.
We claim however that the segregation between methods primarily lies with whether the classification features are based on sample space characteristics or alternatively use windowed convolution operators. The former is the less rigorously justified as the features relating to amplitude and spike width have weak physiological significance but are aggressively more efficient than other methods. The characteristic requirement of the later later approach is typically related to the window size that is indirectly associated with the sampling speed in order to fit the relevant spike shape into the window. One important objective of current decoding research is related to introducing adaptive techniques that iteratively improve classification without supervision. Particularly without excessive memory requirements in order to keep track of long term statistics. To demonstrate why this can be particularly challenging, consider a simplified example of using K-means to directly cluster the sample space adaptively where we are fortunate to know the number of classes is three.
Clearly for each detected spike we would need to evaluate the distance between our data the centroids having a memory and complexity requirement of \((1+F) W\) variables and \(2 F W\) operations respectively. Then once the class is determined an additional \(2W\) operations are needed to adjust the centroid with the new data. This may include keeping track of the additional \(F W\) truncation residues that allow using a small enough adjustment weight for convergence when our quantization is limited by an 8 bit system. Now for completion, assume our window is 32 samples and we come to the conclusion that for each recording channel we to actively need to allocate nearly \(2 K bits\).
Typically this primitive aspect of a generic classification algorithm is intensive enough warrant not performing it locally in contrast to the complexity of the aforementioned temporal features. It also raises the challenge of trained dimensionality reduction without supervision which exclusively relies on evaluating the covariance matrix in order to minimize the correlation of non-signal components with our new basis. The above algorithm may be representative in terms of complexity with the exception that basis pursuit relies heavily on inner products. There is some relief from the fact these optimal basis change very slowly and actively adapting is only needed once few hours as the electrode recording changes slowly with respect to the neural activity. As will be demonstrated the feasibility of these more involved methods rely very much on the careful construction and memory allocation of the algorithm with respect to the processing architecture. Operations like k-means and PCA decomposition can be performed to a certain extent if the operations are broken down into a incremental procedures.
33 Resource Constrained Classification
Primarily to substantiate our expectations for performing processing at the sensor interface and evaluate where the system level requirements should lie. Ultimately such a system needs to encompass a significant variety of different application requirements. We will consider the implementation of two well known methods that process neural recording and generate classified events. This will be applied to the equivalent scenario where the proposed instrumentation front-end is used from Section 17 such that the digitized signal will have considerable dynamic range but lacks analogue filtering. In particular this implies the recorded signal is filtered by a first order butterworth low pass filter in addition to near-DC rejection while being sampled at \(25 KS/s\). We will address the rejection of low frequency aggressors in addition to the computational requirement of typical processing algorithms.
Many of the filter and processing considerations are guided by evaluating accuracy empirically and justified though constrained parametric optimization 8. The particular algorithms implemented here are structured in such a way they perform specific considerations for the underlying hardware. On numerous occasions we will employ single bit accumulators as approximations to the IIR equivalent feedback structures in order to improve our effective register depth through feedback. This is a primary advantage of in-channel processing where we may exhaustively make use of recorded data without having concern for the communication of these components.
Empirical validation is demonstrated by using of synthetic data sets that are publicly available online. This data is based on characterized extracellular recording where both background activity and spike morphologies are extracted from a human neocortex and basal ganglia. The synthesized recording was originally used to evaluate the performance of super-paramagnetic clustering with wavelet decomposition at different background noise levels in 9. Synthetic data specifically allows the inference of the ground truth resulting in unbiased performance indicators. For fair comparison of analogue and digital techniques we additionally include low frequency content from \(1-300 Hz\) at \(10\times\) of the largest peak to peak amplitude of the extracellular action potentials found in the recordings.
34 Spike Detection & Filtering
Arguably the most influential aspect to neural detection and classification algorithms is the signal preconditioning for systematic and accurate detection of spike events. The importance lies with the fact that detection behaviour has the significant influence on how the feature space appears when the spike is characterized with various methods. Although amplitude noise can usually be accounted for in terms of filtering. Any misalignment in the time domain due to noisy aggressors in the detection operator can up modulate low frequency components. The tendency to perform detection in the digital domain is entirely related to the instantaneous characteristic of discreet time processing which is superior to the group delay inherent to analogue implementations. Minimizing this factor will minimize additional memory for capturing any signal before the detection event.
The method proposed here tracks both the mean spike amplitude and back ground noise levels in order to assert the detection level of spike events. Motivated by using physiological characteristics to specify the underlying operation parameter \(k_3\) is introduced to represent the relative amplitude of background activity to that the maximum spike waveform that is intended to be detected. Or in other words if we are only interested in the closest neurons to the electrode \(k_3\) should be close to 1, otherwise if we also want to detect background activity with an amplitude at \(25%\) of the largest spiking events \(k_3\) should approach \(4\). In actuality this term should also be related to how well our classification can separate noisy detection or actual spike events.
\begin{algorithm} \DontPrintSemicolon \KwData{Sample from ADC \(X[n]\)} \KwResult{Detection events & spike window \(W_1\)} \Begin{ \ShowLn Update \(V_{LFP}\) with $k_1 \cdot (X[n]-V_{LFP})$ \tcp*{ Track low frequency content} Set \(S[n]\) with $X[n] - V_{LFP} + k_2 \cdot S[n-1]$ \tcp*{ IIR bandpass filter} Set \(G[n]\) as $\sum S[n]\cdot FIR(2R) $ \tcp{ FIR bandpass filter } Set \(ES[n]\) as $S[n] \cdot G[n-R]$ \tcp*{ Energy Estimate from IIR & FIR product } Update \(V_{noise}\) with $k_1 \cdot (|ES[n]|-V_{noise})$ \tcp*{ Estimate variance on energy estimate }
	\uIf{$ES[n] > V_{th}$ **and** $ES[n] > max_{local}$  \tcp*{ Find threshold crossing or new local max } }{
		Update \\(V_{th}\\) with $k_1 (ES[n-1]+2V_{noise} - k_3 \cdot V_{th})$ \tcp*{ Adapt peaks and varience}
		Initiate Spike Alignment \;
		Set \\(max_{local}\\) to \\(ES[n]\\)  \tcp*{Set local maximum}
		Set \\(index\\) to \\(0\\) \tcp*{Initiate data pointer}
	}
	\uElseIf{Currently alligning spike (\\(index<16\\)) }{
		Set \\(W_1[index]\\) to \\(G[n-10]\\) \tcp*{Store spike waveform with delayed samples}
		Set \\(index\\) to \\(index+1\\) \tcp*{Increment data pointer}
	}
	\uElseIf{Idle state (\\(index>31\\))}{
		Set \\(max_{local}\\) to \\(0\\) \tcp*{Finish classification & find next local maximum}
	}
	\uElse{
		Accumulate \\(index\\) with \\(1\\) \tcp*{Increment data pointer}
		%Perform Classification on \\(W_1[index]\\)\;
	}
}
\BlankLine
\caption{Spike Detection and Alignment}
\label{aglo:T2_Detection}
\end{algorithm}
The specifics of this operation is reflected in Alg. \ref{aglo:T2_Detection}. Here the term Update, Set, and Accumulate represent recurrence, instantaneous, and integrated relations respectively. The state variable \(V_{LFP}\) primarily removes low frequency drift that is not associated with individual spiking events and \(S[n]\) is as a result a bandpass equivalent of our sampled signal \(X[n]\). The signal’s instantaneous energy is represented by \(ED[n]\) which is a product of \(S[n]\) and the delayed derivative computed by the FIR of even order \(2R\) with the coefficients $a_n= -a_{2R-n} = 1-2/R \cdot(n-1)$ for \(n\) from \(1\) to \(R\). The factor \(R\) is in association with the ratio of sampling interval to spike polarization interval, equivalently as $R=f_{nyquist} / 5KHz$. At the maximum of \(ED[n]\) operation on line 5 essentially measures the product of the maximum spike intensity with the maximum derivative that proceeds it by \(R\) samples. This method primarily depends on the fact that spike detection looks for highly correlated narrow band energy which rejects a substantial amount of white noise. Moreover the operator compresses uncorrelated components in amplitude as it exhibits a square dependency in terms of $ED[n] \propto S[n]^2$ making variation in the threshold less sensitive to detection. The fact that the operator is narrow band limits the detection of slower spike waveforms that do not contain large derivative components but on the other hand this grantees more systematic alignment. In this case alignment is done simply with respect to where the peak value of \(ES[n]\) is detected.
Figure 52: Extracted frequency characteristics of digital filter used in Algorithm \ref{aglo:T2_Detection}.
The overall filtering characteristic of \(S[n]\) and \(G[n]\) is shown in Figure 52. The IIR bandpass is a result of \(k_2\) being \(0.5\) such that both filters suppress components around the Nyquist frequency. The group delay should be equivalent to a single high pass pole at $250 Hz$ but the FIR assists in further suppressing high and low frequency components. We should not expect significant contribution from group delay induced distortion as the features of interest will predominantly have $1 KHz - 5 KHz$ components. Besides \(R\) and \(k_1\) can always be adjusted to reposition the high-pass poles closer to DC. Note that \(V_{LFP}\) will represent the \(DC-250 Hz\) signal components that can be used to infer characteristics about the background activity.
Figure 53: False alarm rates normalized by true positives for data sets with different background activity.
The overall performance shown in Figure 53 reflects how spike detection is systematically accurate until the noise level approached \(50%\) of the signal intensity irrespective of the data set with the default case where \(k_3=3\). Note that the white noise is additive to the background activity implying \(-14dB\) of white noise and \(-14dB\) of background activity should evaluate to around \(-8dB\) accumulated SNR. When the noise level exceeds the anticipated background activity for \(K_3\) we observe a strong increase in the number of detected false positives. The rate of erroneously detected false negatives presents a more gradual increase but at this point classification is much more challenging. As expected background has a considerably bigger impact on false alarm rate because spectral content and signal structure is equivalent to that of the foreground activity.
We can observe that the main component for computational complexity in this detection operator arises from the FIR & IIR high-pass filters where the order is closely related to the sampling frequency. In fact if we ignore the buffer used to capture features before the alignment event then this filter accounts for 70% of the memory utilization and 53% of the elementary operations while the rest is used for evaluating the instantaneous energy and performing overhead control. Note that the classification operator should be introduced in line 19 with a sample basis using the index as referenced pointer. This implies that we will be classifying while repolarization occurs at the electrode and our detection trigger is blanked out during this interval. This implies that we lose the capacity to detect any over lapping spikes. Such events have limited occurrences and missing such events can be acceptable because proper classification will likely fail as well.
35 Recursive Variance Decomposition
Another commonly used technique is that of principle component analysis (PCA) which extracts the largest loading vectors \(\nu_n\) of the covariance matrix. This predominantly negates the systematic components of the captured signal and reduces the dimensionality of the spike window to a sub-set of maximally varying features by linear transformation. These components are particularly useful as indication for spiking activity in the signal due to structure in \(\nu_n\) but typically also suffice for providing a basis for classification in low noise conditions and reducing complexity once these vectors are found. The challenge specifically lies with the fact that determining this is basis requires both the computation of the covariance matrix that evolves over time as well as finding the transformation that diagonalizes that matrix. The implication is that in order to extract the first two principle components we need to track a total of \(W(W+3)\) state variables where \(W\) is the number of samples in the spike window.
The iterative method employed here referred to as recursive variance decomposition (RVD) and is an approximation to standard PCA by recursively tracking the largest two loading vectors reducing the minimum number of state variables to $3W + 3$. Similarly to PCA estimators like hebbian eigenfilters 5, every iteration incrementally updates the the learned basis without requiring extensive computation. The methodology is based on recursive extraction of the largest loading vector $|\nu_1|$ that is normalized by \(g_1\) by checking the condition $(x - x \cdot \nu)\cdot \nu = 0 $. This condition checks if there is any residue in the direction of \(\nu_1\) after removing its component to see if it is appropriately scaled. Moreover due to the strong correlation between the mean and first principle component we approximate that $sign(\mu) \approx sign(\nu_1)$ completing the extraction of \(\nu_1\). In fact these two statements allow a significant reduction in complexity as normalization is achieved through feedback. The noise shaping and orthogonality properties associated with PCA is preserved using this extraction which is the most important aspect.
\begin{algorithm} \DontPrintSemicolon \KwData{Spike window \(W_1\)} \KwResult{First two aggregate loading vectors \(\nu_1\) & \(\nu_2\)} \Begin{ \ForEach{Sample n in window \tcp*{ Projection phase } }{ $D_1[n] = W_1[n] - \mu[n]$ \tcp*{ Get distance from mean spike } Accumulate \(p_1\) with $D_1[n] \cdot \nu_1 \cdot sign(\mu[n])$ \tcp*{ Project spike with \(\nu_1\) } Accumulate \(p_2\) with $D_1[n] \cdot \nu_2 $ \tcp*{ Project spike with \(\nu_2\) } } ; \ForEach{Sample n in window \tcp*{ Training phase } }{ Update \(\mu[n]\) with $ k_1 \cdot sign(W_1[n] - \mu[n])$ \tcp*{ Track mean spike } Accumulate \(\nu_{1}[n]\) with $k_1 \cdot sign(| D_1[n]\cdot g_1 | - \nu_{1}[n])$ \tcp*{ Move \(\nu_1\) towards \(D_1[n]\) } Accumulate \(\nu_{2}[n]\) with $k_1 \cdot sign( (D_1[n] - \nu_{1}[n] \cdot p_1)\cdot g_2 - \nu_{2}[n])$ ; \tcp*{ Move \(\nu_2\) towards \(D_1[n]-p_1\cdot\nu_1\) } Accumulate \(p_3\) with $(|D_1[n]| - \nu_{1}[n]\cdot p_1) \cdot \nu_{1}[n]$ \tcp*{ Get gain error } Accumulate \(p_4\) with $(|D_1[n]| - \nu_{2}[n]\cdot p_2) \cdot \nu_{2}[n]$ \tcp*{ Get gain error } } Accumulate \(g_1\) with $k_1 \cdot sign(p_3)$ \tcp*{ Adjust gain on \(\nu_1\) } Accumulate \(g_2\) with $k_1 \cdot sign(p_4)$ \tcp*{ Adjust gain on \(\nu_2\) }
}
\BlankLine
\caption{Recursive variance decomposition}
\label{aglo:T2_PC_l1min}
\end{algorithm}
Algorithm \ref{aglo:T2_PC_l1min} shows the operation for estimating the first two PCA components. Here \(D[n]\) is the new data point off set by the mean spike waveform $\mu [n]$ which allows the long term estimation of aggregate variance. Similarly parameter \(k_1\) specifies how the state variables are exponentially averaged over the preceding data points. Because the projection of the first loading vector must be evaluated before the second vector these operations must be sequenced in time or with memory buffers. The evaluation of \(p_4\) is strictly for illustrating the iterative method at which other components are evaluated while \(g_2\) can also be adjusted to normalize the values of \(p_2\) to prevent overflow without needing \(p_4\).
36 Template Matching using K-means
Finally we consider the implementation of template matching in channel. This can be seen as simply a K-means clustering method without dimensionality reduction on the input vector. The implication is that it is characteristically more memory intensive but requires less computationally intensive operators.
\begin{algorithm} \DontPrintSemicolon \KwData{Spike window \(W_1\)} \KwResult{Classification with respect to aggregate clusters} \Begin{ Accumulate $Spike : Count$ with \(1\) \tcp*{track accumulated statistics} \ForEach{Sample n in window \tcp*{Projection Phase} } { \ForEach{Template k in memory} { Accumulate \(p_k\) with $W_1[n] - T_k[n] $ \tcp*{Get \(l_1\) distance for each spike class} } } Find $p_{min}=min{[|p_1|, : |p_2|, : |p_3|, : |p_4|]}$ and Set \(c\) to index \tcp*{Find most similar} \ForEach{Sample n in window \tcp*{Training phase} }{ Accumulate \(K_c[n]\) with $k_1 \cdot sign(W_1[n] - T_c[n])$ \tcp*{ Adjust most similar class} \If{ Not all templates generated and $Spike : Count > k_2$ } { Duplicate exiting templates ; Set $Spike : Count$ to 0 ; } } } \BlankLine \caption{Incremental K-Means classification} \label{algo:T2_Kmean} \end{algorithm}
The implementation considered in Algorithm \ref{algo:T2_Kmean} is relatively straightforward where one section evaluates the generation of new templates and the other adjust existing templates with new data. The template approach in general has good noise performance due to the redundancy in correlated features that average out white noise. There is some usually some concern with respect to the convergence of k-means centroids. Typically due the the fact that noisy sample points may be initialized as new clusters and thereby wasting memory. The method used here is iteratively duplicating centroids after convergence. This minimizes the impact of noisy data in the feature space. As illustrated in Figure 54 during each iteration the centroids converge to mean positions. Due to the morphology that these centroids may be in we generally need more centroids than there are clusters but this approach works well when there are few spike classes. The assumption here is that we are clustering features that are characteristically Gaussian mixtures.
Figure 54: Illustration of centroid evolution over several iterations.
37 Complexity Evaluation
Generally the application of these methods should reflect a system level objective. For the configuration used here the memory and algorithmic operations are estimated in Table 7. Multiplications are equivalent to eight elementary operations and the memory calls are not considered as a computation but as load/store cycles. The impression made here is that template matching is strictly very efficient if the the memory allows large allocation of active spike waveforms. Similarly RVD could show a considerable reduction in operations if a dedicated multiplier is introduced but that depends on how much we value compactness over execution speed. The disparity in memory requirement will dramatically worsen when the number of centroids is increased which is not the case for the computational complexity in RVD.
Table 7: Estimation on memory and computational resource requirements for each algorithm.
| Algorithm | Memory | Operations | cycles per sample | 
|---|---|---|---|
| NEO Peak Detection | 20 Elements | 30 | 56 | 
| RVD / training | 63 Elements | 29 / 88 | 57 / 116 | 
| Template / training | 85 Elements | 9 / 16 | 27 / 34 | 
Figure 55: RVD and template based classification for data sets with \\(-26 dB\\) background activity.
Figure 56: RVD and template based classification for data sets with \\(-20 dB\\) background activity.
Figure 57: RVD and Template based classification for data sets with \\(-16 dB\\) background activity.
The empirical results in Figure 56 generally show that in moderate noise conditions our classification accuracy is typically better than $85 %$ which is calculated in terms of the aggregate probability of correct classification multiplied with the probability of missing a spike event. Unsurprisingly RVD is not very effective in noisy conditions where the variance accentuates irrelevant components. The classification accuracy from template matching is also shown in Figure 57. These results should primarily show an improved noise rejection characteristic but more generally this approach is more resilient at dealing with false positives. In principle a new cluster will be assigned to a zero mean template representing the false positives while maintaining the other templates intact. Strictly the detection circuit should be readjusted to favour increased detection of false positives as long as the rate of false negatives remains low. But instead exactly the same parameters are used for every test.
We should be careful to judge the effectiveness of these implementations particularly with respect to efficiency. While we can generally increase performance by allocating more memory or introducing additional computation we need to quantitatively evaluate the objective. We suggest normalizing the resource allocation with respect to increased information extracted from the signal by classifying. That is how much more processing are we allocating for classification by proportionally increasing the signal to noise ratio of our output. In the optimistic case when the three classes neurons being detected are uncorrelated our base-line accuracy would be \(33%\) while needing \(56\) cycle of operation for spike detection. In fact this leads us to believe both algorithms in this respect decrease resource efficiency by a factor around \(2\) accounting for an increased memory, processing requirement compensated with increased accuracy. While this claim is very sceptical with respect to the motivation for classifying spike events it also reasons the aggressive reduction in algorithmic complexity through approximations presented here. There is genuine benefit in classification that assists the convergence of further processing algorithms. In addition we simply argue that excessive dedication of resources that exceed that needed for signal conditioning may not be worth while. They key point demonstrated here is that these methods appear very much attainable in terms of on-chip processing capacity. Here we considered the case without supervision specifically in consideration for scalability. It is likely that further reductions or optimizations can be made in that regard to the structure of these methods to improve accuracy and noise tolerance.
38 System Architecture
The conceptual architecture of the system proposed here is foremost based on the opportunity for software defined real-time instrumentation that has not yet be exploited in chronic implantable systems at this scale. Currently it is common place to see synthesized logic that performs all processing and data handling procedures in such a way that they have very limited high-level reconfigurability. This is strictly in order to save power and reduce complexity at the system level. It is important to note that for any recording device there are a multitude of phases during its operation where this flexibility can be highly advantageous once sensor characteristics are learned. Like discussed in Section 32 many classification algorithms benefit from training or characterizing the recording conditions.
The approach to specialized DSP in the literature reflects two problems in this field. The first is signal extraction from recordings that relates to what we have discussed in terms of spike detection to extract compressed spike train data. The other is associated with accelerating adaptive filters that map these spike trains on to estimates for cognitive dynamics or invoked limb movement. Typical examples for spike sorting are fully synthesized cores 6 10 that can be integrated and are capable in achieving respectable processing capacity for specific algorithms. In contrast to spike train decoding that is predominantly performed by FPGAs as integration make less sense for the development high dimensional adaptive filters like 11 that do not need to be embedded within body. Interestingly the work in 12 proposes a application specific instruction set processor (ASIP) that similarly argues for high performance computation for these structures with a high degree of flexibility that reflects the different models used for spike encoding. Additionally we see the advantage of using off-chip microcontrollers like MSP430 that interface with a highly reconfigurable instrumentation front-end to leverage both adaptive and involved noise shaping to perform more intricate operations such as seizure detection or artefact removal 13. While these works may not be viable for high channel count implantable devices it does highlight the considerations for designing fully integrated prosthetics that is in-line with this work.
Here we will consider a particular type of microcontroller topology that can support reconfigurable functionality and reflects the fact that although multichannel BMIs are highly parallel in nature the associated processing can also be algorithmically intensive. The feasibility of this notion has been estimated to an extent but many components are subject to implementation. In essence we optimistically approach this design problem with a strategy that exploits both the homogeneity in processing and the information locality in order to realize a feasible solution. This lets us focus on the in-channel operation where efficiency is maximized through the topology of the execution unit. Regardless of the end result this proposed system will be one step towards the goal for more effective chronic neural implants.
39 Distributed System
The system illustrated in Figure 58 represents the distributed microcontroller architecture. The primary mechanism of operation is the program memory that continuously feeds the stored instructions into the pipelined array of processors that operate locally on the recorded data. The execution of these instructions is handled with what is essentially a instruction decoder, memory module and an arithmetic unit that is interfaced with four analogue recording channels. This approach guarantees that the absolute minimum amount of energy is required for the communication of recorded data as the information is processed and consolidated to its elementary component at the quantization interface.
Figure 58: Illustration of the proposed distributed \\(\mu\\)C array for homogeneous program execution at the sensor interface.
Inherently this implementation will sacrifice the availability of more intricate functionality found in DSPs since the data is not funnelled into one processing unit that can be very elaborate in complexity. The distributed structure is rationalized by the fact that the intensive operations such as clustering methods operate at a much lower speed due to the sporadic spiking activity that make statistical convergence slow. Furthermore these adaptations need to be performed on the order of minutes by which such functions may also be implemented through the redundancy of elementary operations. Moreover multiplexing loses effectiveness in memory intensive applications as it does mitigate the power & area scaling associated with memory allocation.
Also consider that the program control that gives this implementation its capability for generic computation does not scale with the number of processing units. This is an important distinction when addressing a hundreds of channels on chip that will allow this implementation to outperform any other architecture and leverage the fully integrated form factor. We also note that whether this architecture is realized by synthesized logic, FPGA fabric, or more custom logic cells is insignificant to the extent that the memory structure plays a more profound role. This claim is based on the algorithms in Section32 that allocate significantly more resources to memory than algorithmic operators. In particular memory density and efficiency is a critical component to the success of this type of large scale sensor system. Here 3-T eDRAM is employed which is more effective than alternative solutions memory solution and can still be realized on a standard CMOS process 14. When compared to an SRAM equivalent we find it can readily achieve a factor 8 improvement in density 15.
Figure 59: System architecture for NPI sensing platform with digital interfaces annotated.
The high-level interfaces are illustrated in Figure 59. There are multiple layers with respect to how internal resources are accessed for reconfiguration. This is primarily for robustness where each layer increases in complexity and chance of failure. The low-speed interface is the simplest element which acquires commands from an external device with very relaxed requirements on input timing. These commands allow us to reconfigure the high level sub-blocks like tuning the generated reference voltages provided by the power management, control reset/power of individual sub-blocks and selecting which digital test signals should be monitored. In particular the processor array and program memory layers almost operate in isolation to the peripherals. These blocks are timed by the internal PLL structure that drives significantly higher data rates that do not need to propagate to the pad level in order to save power. The back-end of the system similarly communicates data uni-directionally between two different clock domains to send data packets off-chip using a number of handshaking protocols.
The implementation of the analogue circuitry has been discussed in Section 23 where we additionally constrain all algorithms to a maximum of 1024 cycles per sample while maximally allocating \(128\) words of memory. With respect to our previous discussion this amount of hardware should allow a large set algorithms that are resource efficient. If not the topology will promote the construction of processing with more aggressive memory efficiency and using feedback dynamics to implement more complex operators such as division. It should be noted that these specifications have flexibility by sampling multiple times per program cycle or reducing the system clock using the configurable phase locked loop in order to reduce power.

Figure 60: Physical implementation of NPI system using a 6-metal $0.18 \mu m$ CMOS process.
Figure 60 presents the fabricated prototype device. It can be seen that integrating many peripheral blocks such as a phase locked loop, voltage supply regulators, and program memory on chip minimizes the pad count required for the digital and power domains. However even for a 64 channels system the number of analogue pads required for the sensor interface play a significant role on top level organization. In addition careful consideration has to be made with respect to how the digital signals propagate where minimizing track length not only reduces digital noise coupled to the substrate but more significantly the associated power dissipation. The number of processing elements can in fact quite easily be scaled up by extending the instruction pipeline where the system level timing constraint for speed and fanout lies with the program memory which has an internal pipeline that needs to connect the program memory together.
40 Processing Core
In order to allow the hardware to provides generic processing capabilities in a distributed fashion a number of considerations have to be made. In particular we need to reflect the typical operations with certain modalities of operation. It is clear that although all recording channels should execute the same algorithm they will typically not share the same state of operation. This state dependency is exemplified with respect to intermittent processing during bursting neural activity and idling during quiet periods. This is an inherent limitation to sharing the program memory as the dynamic execution of the code where each core has its own program counter or a top level scheduler is not feasible for an arbitrary number of channels. The quasi-out-of-order execution makes it challenging for us to adopt scalable tile structures found in image processing 16 that excel in maximizing area and power efficiency in a scalable sense.
Lead by maximizing the locality of data execution 17 where this aspect of branch control or conditional execution is mediated by skipping a section of the incoming instructions if a condition is not met. The approach of skipping sections of code up on branching is relatively in-efficient with respect to throughput. This approach is optimal at the system level when individual cores may need to execute any section code and branching will only be limited by the dissipation related to the registers pipe-lining the instructions across the chip.
Figure 61: Organization of the distributed execution unit detailing components and the interconnect.
The individual components of the execution unit are shown in Figure 61 and details the main data buses used for exchanging data. The majority operations revolve around manipulating data in the registers R1-R16 as A operand in association with any other data sources that can be used as B operand. The operation performed by the arithmetic logic unit (ALU) will always overwrite the result to the location of the A operand but can in extension also be used to to write to other locations (i.e. memory, periphery, etc.). This implies that in terms of instructions there are always two components where the first is simply the operation executed by the ALU in addition to the two memory sources. The second component optionally extends this simple functionality by writing these intermediate values to multiple other locations or arbitrary branching operations that will take the unit out of sleep.
On that note we mention that the local execution controller consists of three registers that assist in branching operations or conditional execution. When either of these registers have logic one the instruction is gated by a null operation before execution. One of these registers will self reset allowing for if-else functionality by skipping a single instruction. The two other registers need to be cleared actively but in combination this will allow for nested conditioning of up to three levels. While in idle state no internal registers are clocked with the exception of the instruction pipeline and the branch controller saving a significant amount of power as the instruction does not need to be decoded.
The digital data interface provides the means for communicating data either off chip or to adjacent execution units. This functionality allows granular consolidation of features or signal structure and correlate measurements with system level parameters. For example consider each execution unit is listening to the most informative analogue instrumentation channel, it is conceivable that comparing its spike train with that of an adjacent units to evaluate neural interconnect level features. The Asynchronous data bus on the other hand is a key feature that allows this system to appear as a slave at the network level that does not need to be coherent with the system or off-chip clock. This bus is in essence a large buffer distributed across many channels utilising asynchronous hand-shake protocols to funnel the data to a SPI module that is clocked either externally or internally 18. This solves a number of coherence problems that mitigates the need of having a FPGA to drive this system as the SPI module is not timing critical. Furthermore this alleviates clock distribution as the timing constraints are always local to each execution unit and not the data bus that is distributed across the chip which may either be very restricting or power intensive.
The dynamic control with respect to the analogue channel is enabled by one designated 8 bit register per analogue channel. In this particular case 4 bits are used to specify gain, 2 bits for configuring the biasing current as 0x,1x,2x,3x, and 1 bit of the reset function. In particular the reset phase will temporarily boost the transconductance on the band-limited filtering stage to allow sub-microsecond auto-zero for active noise shaping. For both the ADC and the amplifiers there is one bit that controls a multiplexer at the input that can switch in the sensor or a global differential test net for calibration or verification. Similarly the ADC has 2 bits to select which analogue channel another bit to clock at the full rate or half the rate of ad joint micro-controller. In addition there are 3 bits to control the how the chopper frequency is divided from the sampling signal which is the final control bit. Understandably the analogue configuration will remain static after the appropriately being set. The ADC configuration register is considerably more dynamic as the multiplexer needs to be reconfigured and sampling needs to toggle persistently.
There are two modes of getting quantized data from the ADC depending on the desired functionality. The first is simply reading the 8 bit quantization register that shadows the 7bits quantized by successive approximation and the LSB from the first integration result. In order to utilize the higher resolution capability the comparator output is used to integrate coefficients from the instruction onto a local register where the comparators will decrement or increment the register accordingly. If no calibration data is locally stored this operation first integrates binary weights on one register during the SAR cycles and then integrates the FIR window onto another register. This is large investment of cycles to perform high resolution quantization but this can be optimized for specific applications when it is necessary. If the calibration data is available for the 7 SAR weights then the ADC must be configured to run at half the system speed and before quantization these weights are loaded from the memory onto registers R2-R7. Followed by the usual process of SAR quantization while these weights as simultaneously also integrated on a second register. Then after the integration phase three registers will contain quantized data. The scaling of coefficients is key and should be such that the $\Sigma \Delta$ result simply copies the sign bit of the SAR operation and can concatenate the lower 7 bits with the SAR result. Then the calibration data is scaled appropriately and added to the 14bit signed double with carry logic. Clearly there a number of conventions suggested here that will best exploit the capabilities of the design.
The memory module local to each execution unit hold 128 words of data which can be shared across the analogue channels with 32 locations each. Particularly when the DSP is mainly performing filtering the recorded data can be buffered for FIR filtering or keep its high precision filter state variables for IIR structures. These filter and program coefficients are stored in the shared program memory such that the execution unit does not experience an overhead in memory requirement. However for other memory intensive algorithms such as template matching, serving the most informative of the four analogue channels will have to suffice because the memory requirement is beyond the capabilities of this configuration. The DRAM architecture has a refresh-up-on-read mechanism which implies that the used memory locations will have to systematically be read to keep the data stored valid. Fortunately this requirement is self fulfilling as the program recycles itself every $100 \mu s$ and the DRAM retention time is on the order of $1 m s$ implying that as long as there is a guaranteed read on the memory location it will stay valid. The physical read mechanism however does require a minimum of two cycles. The first is in the background which simply prepares the internal registers of the module while a different execution is taking place and the second is in the foreground where the location is read and the data bus is driven by the DRAM.

Figure 62: Physical implementation of execution unit using a 6-metal $0.18 \mu m$ CMOS process
As the illustration in Figure 62 shows, keeping the 8 bit structure in terms of parallel operations maintains a very compact floor plan. This is typical of data flow intensive designs where the digital logic should be placed underneath the associated data buses. This is difficult to replicate by automated synthesis tool where signal congestion is the most stringent aspect. The digital signals for the two operands and the data line span horizontally where sub-blocks extensively take advantage of the gated output buffers for each sub-block that is controlled by the decoders. The full custom approach taken here sacrifices design effort for additional performance in terms of reduced parasitics and more aggressive power gating.
$$ \mbox{\textbf{\textless C\textgreater,[\textless CE\textgreater],\textless A\textgreater,\textless B\textgreater,[\textless OE\textgreater*]}} $$
The syntax for constructing instructions needs to be in the Backus Normal Form 19 as formatted in Equation35 with reference to Table 8 which summarises all possible compositions. A parser is implemented that will translate an ordered set of these instructions directly into hardware specific machine code that needs to fed into the instruction pipeline any violations or exceptions will be caught by this script automatically. Although there is no dedicated multiplication hardware there are specialized registers that allow shift add based multiplication over eight cycles. Any other primitive logical or arithmetic function can be realized with this instruction set as it is turning-complete. This assertion is made by noting that it can evaluate the operation; subtract and branch if less than or equal to zero, which is sufficient for a one instruction set computer 20.
Table 8: Overview of instruction sub-components.
| Index | Operation Subset | Summary of Possible Entries | 
|---|---|---|
| C | Logic Operation: | Logical Shift Left/Right, Arithmetic Shift Left/Right, XOR, XNOR, AND, OR, MOVE-A, MOVE-B | 
| C | Arith. Operation: | Compare, Add, Carry Add, Multiply, Complete Multiply | 
| CE | Compare Option: | \textgreater, =, \textless, Overflow | 
| CE | Add Option: | Subtract, Absolute Value, Increment Overflow Bit | 
| CE | Mov Option: | Mem. Address is from Data line. Default is from Instruction | 
| A | Operand A: | R1-R8, R9-R13, ID, Count, Memory | 
| B | Operand B: | R1-R8, Left uC, Right uC, Instruction, ADC, Memory, Null | 
| OE | IO Extension: | Write to Left uC, Write to Right uC, ADC Sample enable, Write Output Buffer | 
| OE | Branch Extension: | Write to Branch Register BR1-BR3, Invert Branch Result | 
| OE | Memory Extension: | Write Address, Write Data, Read Data | 
It should be mentioned that there a number of hardware specific details with respect to how certain instructions behave that need careful consideration towards the implementation details. For example if no comparison is made but a branch register is accessed the output of the comparator will be treated as false no matter what logic the overflow bit is. This allows us to clear or set branch registers while simultaneously performing an operation. Another example is that by default the instruction data is ready at the input of the memory address to prepare a read in the background. In most cases it is intuitive and we simply strive to maximize the cycle efficiency. At all times the execution unit is capable of dealing with the compute aspects while performing branching and memory access simultaneously.
This work also provides an elaborate set of test tools that allows compilation of instruction code and the generation of piece-wise-linear ‘csv’ files for test sources that can be used in the circuit simulators. This can be used in association with the transistor or verilog implementation of the processing core. The behavioural models in particular are important for the translation of this architecture to other implementations.
Figure 63: Power dissipation with respect to specific operations for the same operand A=113 & B=114 in randomized order.
The results in Figure 63 exemplifies the dependency of power dissipation with respect to different operators for the same operand A and B. It should be expected that the is a strong operand dependency with respect to power consumption but these results follow our expectations closely. Generally the simpler the operation the lower the current dissipation is because less complexity is involved with the switching losses. Here again we observe that when the unit is in a sleep or branching state the power dissipation is mainly associated with the instruction pipeline. As this 32bit pipeline transverses the entire execution unite it plays a significant contribution towards the baseline power consumption. The typical power consumption for full activity will lie around $45 \mu A$ it should be noted that sporadic spiking activity will gate the majority of operations and it is likely that running at half the designed rate with 512 cycles is more than sufficient. Note the typical figure of power is \(2.7 pJ/Cycle\) or $2.7 \mu W/MIPS$ which is several orders of magnitude better than 16-bit microcontrollers such as the MSP-43021.
Table 9: Summary of performance specifications for the NPI system and state-of-the-art specialized integrated processing architectures. \(^\star\) Reconfigurable topology.
| Parameter | Unit | This Work | Markovic 22 | Arimoto 16 | 
|---|---|---|---|---|
| Architecture | Distributed (\mu)C Array | Multi-Grain FPGA | Dedicated Tile Array | |
| Technology | [nm] | 180 | 40 | 65 | 
| Supply Voltage | [V] | (1.2) | (1) | (1.2) | 
| Parallel Units | (64) | (16^\star) | (2048) | |
| Instruction Size | [bits] | (32) | - | (32) | 
| Operational Frequency | [MHz] | (20) | (400) | (300) | 
| Sampling Frequency | [S/s] | (32k) | (100M) | - | 
| Operations per Sample | [Cycles] | $256 $ | (4) | - | 
| (P_{Digital}) per Channel | [(\mu)A] | (44) | - | - | 
| (P_{Analogue}) per Channel | [(\mu)A] | (16) | - | - | 
| System Power | [mA] | (1.42) | $ 11.6 $ | (300) | 
| Program Memory Capacity | [kb] | (32) | - | - | 
| Processor Memory Capacity | [kb] | (1) | (36) | (1) | 
| Processor Array Area | [mm(^2)] | $1.04 \times 1.32$ | $3.8 \times 5.4$ | $1.60 \times 3.19$ | 
| Power Efficiency | [GOPS/mW] | (1.52) | $ 0.86 $ | (0.31) | 
| Area Efficiency | [GOPS/mm(^2)] | (0.88) | $ 2.34$ | (36.1) | 
The specifications given in Table 9 summarize the main features associated with this system on chip for processing neural data at the sensor interface. As the total power consumption is on the order of \(1.5 mW\) there is some concern with respect to the power density associated with the system in full operation that in this particular case is \(26 mW/cm^2\). In fact if the number of channels is scaled up beyond 64 channels this power density will tend to \(29 mW/cm^2\) but will not exceed it. Either figure will likely be smaller subject to the physical & software implementations but more importantly will not result in a thermal agitation or the heating of cortical tissue that exceeds \(2^{\circ}C\) 23. More generally we have the advantage of tuning processing capabilities to the heat-capacity of the implanted package. In fact comparing this work to state of the art FPGA topologies22 and highly parallel ASIC structures16 that follow the same design methodology we find that power and area efficiency that exceeds that of stand-alone microprocessors by orders of magnitude. These figures also reflect the expectation that technology scaling should lead to even more compact configurations. In addition Gate leakage may introduces some diminishing returns with respect to power efficiency. We mention that these figures are extrapolated based on the performance of a single execution unit and we expect more overhead from other components that is not accounted for in this comparison.
$$ R_{D} = \frac{P_{\mu C} \cdot A_{\mu C}}{N^{2}_{chan} \cdot Cycles} = \frac{44 \mu W \cdot 196 \times 158 \mu m^2} {4^2 \cdot 256}\approx 3.3 \cdot 10^{-16} : \left[W mm^2 : per : OP \right] $$
Re-evaluating our power/area figure of merit in Figure 49 with Equation 36 we observe that practically we lose a factor of ten in efficiency when compared to a dedicated ASIC implementation because resource utilization inside the execution unit can not be maximized. This was expected given that instead we attain high-level reconfigurability. However this does achieve a very good understanding with regard to where the system scales from this point both with respect to area and power requirements.
41 Testing Platform
As this system is directed at generic use for the neuroscience community where high level programming and interfaces are essential for end user adoption. The testing platform presented here is aligned in such a fashion that its fundamental components can be extended upon greatly to serve a multitude of needs. This ambitious design criteria is primarily provided by the real-time platform illustrated in Figure 64 that supports a standard Linux operating system. The thee components compromise of the custom NPI system on chip, the Raspberry Pi platform, and networked resources.
Figure 64: Block diagram of the instrumentation platform developed as framework for real-time applications.
The software stack running on the Raspberry Pi primarily handles the high speed SPI link that fetches data from the NPI system at \(10 Mb/s\) and stores it to a local buffer for some of the data visualization. This data stream is then forwarded to a network routine that is connected to a server over the local area network via a UDP protocol to allow large quantities of data to be stored in a scalable fashion. The graphical user interface is built on top of this process in order to give a means to both configure the device actively and provide some form of interactive interrogation with respect to the recorded data and the algorithm being executed.
The application of a generic internet of things platform plays a important role with respect to long term development objectives. It signifies that the ASIC is there to provide a specialized interface with the sensor and a generic digital interface with the external control to allow rapid adoption of new techniques or other components as software extensions. This substantiates the modular approach where design effort is explicitly focused towards specialized hardware for the sensor and software development at the system level. This is important given the complexity of these systems where overspecialisation limits the versatility of existing designs thereby limiting the utility of other commercially available tools/devices.
The advantage here is that a multitude of procedures can be run on the real-time platform without supervision that are detailed in high-level programming code that have fast development and turn-around capabilities. In this case it significantly improved test procedures by enabling automated exhaustive characterization of logical integrity. In fact the standalone module of the microcontroller structure can run 1 MIPS of on the fly randomly generated operations. This can be seen in Figure 65 where the Saleae logic analyser is used to probe the internal data bus of one particular core.

Figure 65: Digital waveform of the internal data bus BIT 1-8 as new instructions are being loaded into the device using the clocked Latch and Configure signals.
Table 10: Section of Instructions and recorded outputs from $\mu C$ structure with the associated machine code.
| BITLINE | INSTRUCTION | Machine Code | 
|---|---|---|
| 00011010 | MOVB R5 DINST 26 | 0011111100000000011000000011010 | 
| 11101111 | MOVA R3 DINST -17 | 0011011000000000011010011101111 | 
| 00001010 | AND R5 R3 | 1110111100000000000000000000000 | 
| 00011010 | MOVB R5 DINST 26 | 0011111100000000011000000011010 | 
| 11101111 | MOVA R3 DINST -17 | 0011011000000000011010011101111 | 
| 11111111 | OR R5 R3 | 1110111100000000000010000000000 | 
| 00011010 | MOVB R5 DINST 26 | 0011111100000000011000000011010 | 
| 11101111 | MOVA R3 DINST -17 | 0011011000000000011010011101111 | 
| 11110101 | XOR R5 R3 | 1110111100000000000100000000000 | 
This is partly shown in Table 10 where the internal bit-line of one such execution unit could be directly accessed. Because it is not viable for us to exhaustively simulate the hardware in various conditions we use a physical test bench in order to record the performance tolerance with respect to voltage supply and operating frequency. Moreover what the user sees is reduced to latent frames of data over several milliseconds and the corresponding instruction code executed by the platform. The physical interfacing protocols are very much transparent. By construction each core has a hard wired ID that will allow the active supervision of internal variables for development and debugging of single units. Due to the specialized hardware the instrumentation programs currently still require careful tailoring of the instruction code but this can be extended towards compiling directly from C++ code that is also used to construct the rest of the platform.
Figure 66: Graphical user interface used for configuring the NPI system showing test data.
Figure 66 depicts the GPU accelerated graphical set-up used for testing the device where the functionality is mainly associated with reconfiguration and powering different system sub-blocks for validation. From a engineering point of view it is more of a convenience to have automated reconfiguration of the device as one interacts with the various settings. Particularly in associated with probing the supply voltages or analogue reference signals generated on chip. It would be more typical that during experimentation this functionality can be reduced to simply selecting from a set of predetermined programs.
Figure 67: Test bed used for characterization with various components illustrated.
In order to move towards fully isolated operation which will be the case for a implanted device the system on chip architecture relies on a minimum amount of off-chip components in order to bring the resource requirements of the topology into scope. This is shown in Figure 67. These feasibility considerations are generally with respect to reasonable assumptions associated with a wireless implant that is hermetically sealed. In this particular case we will allow a number of off-chip decoupling capacitors, a reference resistor and a reference voltage which may very well be integrated on chip in one way or another if necessary. The system also uses a \(1 MHz\) external clock reference which may be realized at the wireless power carrier frequency and is locked onto with a phase locked loop to generate the internal \(20 MHz\) system clock. Three linear LDOs were integrated to provide a \(1.2 V\) supply to the digital,analogue, and memory separately. Where the analogue supply voltage used to derive internal ADC voltages references of \(1.2 V,0.9 V,0.6 V,0.3 V\) from the unregulated supply using high speed buffers.
42 Conclusion
This chapter substantiates a scalable and long-term approach for the development of programmable neural interfaces. In particular we discuss why moving away from the fixed purpose DSP architectures seen in many conventional systems is of significance with respect to performance and reliability. In addition we provide indicators that show the majority of modern CMOS technologies using dedicated on-chip processing hardware is viable to perform local signal analysis. Furthermore we highlight the importance of efficient algorithm construction were operators should revolve around execution per sample and processing structures that improve scalability for systems with many recording channels in association with the near-data-processing paradigm. PCA & template maching methods are proposed for embedded systems that require 57 operations per sample and 680 bits of memory with entirely unsupervised operation that can achieve over 80% accuracy during spike detection & classification.
A distributed micro-controller structure is proposed in effort to realize these characteristics and reveal underlying constraints. The topology reflects the nature of processing neural data in the context of achieving generic computational capacity. This discussion details both low-level and system level considerations that address the software stack. The impact of memory requirement that results from being able to execute arbitrary algorithms in isolation is evident both in-channel and chip level. In the proposed configuration the amount of resources allocated for this function is comparable to that of the signal processing but depends very much on the number of channels that are integrated together. We point out that if the number of channels is increased this component does not change and allows this topology to become more effective. The distributed processing architecture operates with an efficiency of 1.52 GOPS/mW and each core only requires a 0.02mm\(^2\) silicon foot print with fully reconfigurable 8 bit processing capabilities.
The foregoing discussion has depicted the intricate complexity associated with these sensing systems and revealed the diversity of aspects that should be taken into consideration. Sustainable development for these systems will need long-term solutions due to the excessive design effort that prevents rapid turn around and progress. Moreover innovation needs to be contextualized at the system level to ascertain whether new techniques and methods have significant impact. This requires the abstraction and modelling of these implementations to gauge impact using empirical indicators.
References:
- M.Braverman, J.Schneider, and C.Rojas, ‘‘Space-bounded church-turing thesis and computational tractability of closed systems,’’ Physical Review Letters, vol. 115, August 2015. [Online]: http://link.aps.org/doi/10.1103/PhysRevLett.115.098701 ↩︎ 
- M.Verhelst and A.Bahai, ‘‘Where analog meets digital: Analog-to-information conversion and beyond,’’ IEEE Solid-State Circuits Magazine, vol.7, no.3, pp. 67–80, September 2015. [Online]: http://dx.doi.org/10.1109/MSSC.2015.2442394 ↩︎ 
- H.A. Marblestone, M.B. Zamft, G.Y. Maguire, G.M. Shapiro, R.T. Cybulski, I.J. Glaser, D.Amodei, P.B. Stranges, R.Kalhor, A.D. Dalrymple, D.Seo, E.Alon, M.M. Maharbiz, M.J. Carmena, M.J. Rabaey, S.E. Boyden, M.G. Church, and P.K. Kording, ‘‘Physical principles for scalable neural recording,’’ Frontiers in Computational Neuroscience, vol.7, no. 137, 2013. [Online]: http://www.frontiersin.org/computational_neuroscience/10.3389/fncom.2013.00137 ↩︎ 
- Y.Suo, J.Zhang, T.Xiong, P.S. Chin, R.Etienne-Cummings, and T.D. Tran, ‘‘Energy-efficient multi-mode compressed sensing system for implantable neural recordings,’’ IEEE Transactions on Biomedical Circuits and Systems, vol.8, no.5, pp. 648–659, October 2014. [Online]: http://dx.doi.org/10.1109/TBCAS.2014.2359180 ↩︎ 
- B.Yu, T.Mak, X.Li, F.Xia, A.Yakovlev, Y.Sun, and C.S. Poon, ‘‘Real-time fpga-based multichannel spike sorting using hebbian eigenfilters,’’ IEEE Transactions on Emerging and Selected Topics in Circuits and Systems, vol.1, no.4, pp. 502–515, December 2011. [Online]: http://dx.doi.org/10.1109/JETCAS.2012.2183430 ↩︎ ↩︎ 
- V.Karkare, S.Gibson, and D.Marković, ‘‘A 75- $\mu$w, 16-channel neural spike-sorting processor with unsupervised clustering,’’ IEEE Journal of Solid-State Circuits, vol.48, no.9, pp. 2230–2238, September 2013. [Online]: http://dx.doi.org/10.1109/JSSC.2013.2264616 ↩︎ ↩︎ 
- V.Ventura, ‘‘Automatic spike sorting using tuning information,’’ Neural computation, vol.21, no.9, pp. 2466–2501, September 2009. [Online]: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4167425/ ↩︎ 
- D.Y. Barsakcioglu, A.Eftekhar, and T.G. Constandinou, ‘‘Design optimisation of front-end neural interfaces for spike sorting systems,’’ in IEEE Proceedings of the International Symposium on Circuits and Systems, May 2013, pp. 2501–2504. [Online]: http://dx.doi.org/10.1109/ISCAS.2013.6572387 ↩︎ 
- R.Q. Quiroga, Z.Nadasdy, and Y.Ben-Shaul, ‘‘Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering,’’ Neural Computation, vol.16, pp. 1661–1687, April 2004. [Online]: http://dx.doi.org/10.1162/089976604774201631 ↩︎ 
- A.M. Sodagar, K.D. Wise, and K.Najafi, ‘‘A fully integrated mixed-signal neural processor for implantable multichannel cortical recording,’’ IEEE Transactions on Biomedical Engineering, vol.54, no.6, pp. 1075–1088, June 2007. [Online]: http://dx.doi.org/10.1109/TBME.2007.894986 ↩︎ 
- Y.Xin, W.X. Li, R.C. Cheung, R.H. Chan, H.Yan, D.Song, and T.W. Berger, ‘‘An fpga based scalable architecture of a stochastic state point process filter (ssppf) to track the nonlinear dynamics underlying neural spiking,’’ Microelectronics Journal, vol.45, no.6, pp. 690 – 701, June 2014. [Online]: http://www.sciencedirect.com/science/article/pii/S0026269214000913 ↩︎ 
- Y.Xin, W.X.Y. Li, Z.Zhang, R.C.C. Cheung, D.Song, and T.W. Berger, ‘‘An application specific instruction set processor (asip) for adaptive filters in neural prosthetics,’’ IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.12, no.5, pp. 1034–1047, September 2015. [Online]: http://dx.doi.org/10.1109/TCBB.2015.2440248 ↩︎ 
- C.Qian, J.Shi, J.Parramon, and E.Sánchez-Sinencio, ‘‘A low-power configurable neural recording system for epileptic seizure detection,’’ IEEE Transactions on Biomedical Circuits and Systems, vol.7, no.4, pp. 499–512, August 2013. [Online]: http://dx.doi.org/10.1109/TBCAS.2012.2228857 ↩︎ 
- K.C. Chun, P.Jain, J.H. Lee, and C.H. Kim, ‘‘A 3t gain cell embedded dram utilizing preferential boosting for high density and low power on-die caches,’’ IEEE Journal of Solid-State Circuits, vol.46, no.6, pp. 1495–1505, June 2011. [Online]: http://dx.doi.org/10.1109/JSSC.2011.2128150 ↩︎ 
- R.E. Matick and S.E. Schuster, ‘‘Logic-based edram: Origins and rationale for use,’’ IBM Journal of Research AND Development, vol.49, no.1, pp. 145–165, January 2005. [Online]: http://dx.doi.org/10.1147/rd.491.0145 ↩︎ 
- T.Kurafuji, M.Haraguchi, M.Nakajima, T.Nishijima, T.Tanizaki, H.Yamasaki, T.Sugimura, Y.Imai, M.Ishizaki, T.Kumaki, K.Murata, K.Yoshida, E.Shimomura, H.Noda, Y.Okuno, S.Kamijo, T.Koide, H.J. Mattausch, and K.Arimoto, ‘‘A scalable massively parallel processor for real-time image processing,’’ IEEE Journal of Solid-State Circuits, vol.46, no.10, pp. 2363–2373, October 2011. [Online]: http://dx.doi.org/10.1109/JSSC.2011.2159528 ↩︎ ↩︎ ↩︎ 
- R.Nair, ‘‘Evolution of memory architecture,’’ Proceedings of the IEEE, vol. 103, no.8, pp. 1331–1345, August 2015. [Online]: http://dx.doi.org/10.1109/JPROC.2015.2435018 ↩︎ 
- C.E. Molnar and I.W. Jones, ‘‘Simple circuits that work for complicated reasons,’’ in IEEE Proceedings of the International Symposium on Advanced Research in Asynchronous Circuits and Systems, 2000, pp. 138–149. [Online]: http://dx.doi.org/10.1109/ASYNC.2000.836995 ↩︎ 
- H.Schorr, ‘‘Computer-aided digital system design and analysis using a register transfer language,’’ IEEE Transactions on Electronic Computers, vol. EC-13, no.6, pp. 730–737, December 1964. [Online]: http://dx.doi.org/10.1109/PGEC.1964.263907 ↩︎ 
- D.Wang, A.Rajendiran, S.Ananthanarayanan, H.Patel, M.Tripunitara, and S.Garg, ‘‘Reliable computing with ultra-reduced instruction set coprocessors,’’ IEEE Micro, vol.34, no.6, pp. 86–94, November 2014. [Online]: http://dx.doi.org/10.1109/MM.2013.130 ↩︎ 
- ‘‘Msp430g2x53 mixed signal microcontroller - data sheet,’’ Texas Instruments Incorporated, Dallas, Texas, pp. 403–413, May 2013. [Online]: http://www.ti.com/lit/ds/symlink/msp430g2553.pdf ↩︎ 
- F.L. Yuan, C.C. Wang, T.H. Yu, and D.Marković, ‘‘A multi-granularity fpga with hierarchical interconnects for efficient and flexible mobile computing,’’ IEEE Journal of Solid-State Circuits, vol.50, no.1, pp. 137–149, January 2015. [Online]: http://dx.doi.org/10.1109/JSSC.2014.2372034 ↩︎ ↩︎ 
- P.D. Wolf, Thermal considerations for the design of an implanted cortical brain–machine interface (BMI).\hskip 1em plus 0.5em minus 0.4em elax CRC Press Boca Raton, FL, 2008, pMID: 21204402. [Online]: http://www.ncbi.nlm.nih.gov/books/NBK3932 ↩︎