# Welcome to the Brain Whispers blog, where the brain whispers.

## Hidden Markovian Dynamic Connectivity

Last updated: Jan 20, 2017

￼But yet, I say,
If imputation and strong circumstances
Which lead directly to the door of truth
Will give you satisfaction, you may have ’t.

- Othello: Act 3 Scene 3

When you view Mona Lisa, would you be able to tell what is underneath her smile? That is to ask, is there something (e.g. her mood of happiness, or of sadness, etc.) underneath Lady Lisa’s smile that is driving (or “causing”) her to smile? If we assume that her smile changes over time, how does the hidden state change accordingly?

Now, consider a hidden genetic machanism involing an exon, a splice site, and an intron, each of which could generate a nucleotide (A, C, G, and T) of different probabilities (see [1] below). The generated sequence of nucleotides are the building blocks of nucleic acids that forms life (e.g. DNA and RNA) and nucleoside triphosphates (e.g. ATP) that carry energy units.

Foundemental to these is a process called the hidden Markov model where observations such as smiles and nucleotides are generated by a hidden state such as underlying mood and a set of exons, splice sites, and introns.

What related to brain science here, I think, is the the dynamic connectivity. That is, we would like to uncover how dynamic brain connectivity is potentially driven by a sequence of underlying hidden states. This intrigues us to enquire further into the hidden Markov model.

##### 1. Why Hidden Markov?

There are a few attractive points associated with the HHM.

1. It is the hidden states that are Markov; not the observed connectivities;
2. Each hidden state drives (“causes”) a connectivity, over time;
3. We could predict the connectivity at time t+1 based upon the connectivities during previous time points;
4. It allows us to study the parameter $\theta$; but what is $\theta$ here?
5. To estimate $\theta$, we need the EM (expectation maximization) algorithm;
The HMM involves many interesting theories, including probability theory, graph theory, Bayesian theory, and structual equation.
##### 2. Definition (in the context of dynamic connectivity)

Define $S_1, S_2, \cdots, S_T$ as hidden states over T time points (a time point here could be referred to as a time window) $\mathcal{C}_1, \mathcal{C}_2, \cdots, \mathcal{C}_T$ are dynamic connectivity matrices over T time points. For any $i \in \{1, 2, \cdots, T \}$, we have $S_i \in \{1, 2, \cdots, k \}$; namely there are k hidden states.

Referring $\{S_i, \mathcal{C}_i \}_{ 1 \leq i \leq T}$ to a HMM, we imply

[A] $S_i \rightarrow \mathcal{C}_i, \forall i;$

[B] $S_i \sim Markov(n)$, where $Markov(n)$ infers the $n^{\text{th}}$ order Markov.

Above [A] and [B] can be represented in two ways: graphically, and probablistically.

For n = 1, graphically, [A] and [B] can be drawn as follows:

For n = 1, probabilistically, we have $\mathbb{P} \bigg(S_1, S_2, \cdots, S_T; \mathcal{C}_1, \mathcal{C}_2, \cdots, \mathcal{C}_T \bigg) = \bigg[\prod_{t=2} ^{t=T} \mathbb{P} (S_t \vert S_{t-1}) \mathbb{P}(\mathcal{C}_t \vert S_t) \bigg] \mathbb{P} (\mathcal{C}_1 \vert S_1) \mathbb{P} (S_1)$, where three types of probabilities need to be specified

[B1] $\mathbb{P}(S_i);$

[B2] A trainsitioning matrix $T_{K \times K}$ where $T_{ij} = \mathcal{P}(S_l = j \vert S_{m} = i),$ $\forall l, m \in \{1, \cdots, T \}, \forall i, j \in \{1, \cdots, K \}.$

[B3] An emission matrix $\mathcal{E}_{K \times L}$, (assuming $\mathcal{C}_t$ is discret and $\mathcal{C}_t \{1, \cdots, L\}$) where $\mathcal{E}_{kl} = \mathcal{P}(\mathcal{C}_i = l \vert S_i = k),$ $\forall i \in \{1, \cdots, T \}, \forall k \in \{1, \cdots, K \}, \forall l \in \{1, \cdots, L \}.$

##### 3. Some Graph Theory

Consider three nodes A, B, and C, where A goes into (by into there is a sense of direction) B, and B goes into C. Then we say

A is the parent of B (or B is the child of A);
B is the parent of C (or C is the child of B);
C is the child's of child of A (or A is the parent's parent of C).

If A goes into B without goes through another node, then we say the path from A to B is a directed path. If, however, A goes into B via some another node(s), then we say the path from A to B is an undirected path.

Properties:

Any node, once conditioned upon its parents, it is hence independent from its nondescendents. Intuitively, once a node’s parents are given, its property (value) is therefore determinastic (the conditioning blocks the pathways between itself and its parents); hence the node becomes independent from even its ancesters (but still associated with its descendents).

Theory:

Let $\mathcal{A}$, $\mathcal{B}$, and $\mathcal{C}$ be three sets of nodes. We say $\mathcal{A} \perp \mathcal{B} \vert \mathcal{C}$ if along every undirected path between $a \in \mathcal{A}$ and $b \in \mathcal{B}$ there exists a node $d$ such that (1) $d$ has converging arrows (arrows coming both into $d$) and neither $d$ nor its descendents are in $\mathcal{C}$;
(2) $d$ does not have converging arrows and $d \in \mathcal{C}.$

##### 4. Dynamic Bayesian Network

Besides the first order and higher order Markov models we briefly discussed above. Critical to hidden Markov models are state space models. To me, this is how the model assumptions are put forward to estimation, through a transition function $f_t$.

Recall $\mathbb{P} \bigg(S_1, S_2, \cdots, S_T; \mathcal{C}_1, \mathcal{C}_2, \cdots, \mathcal{C}_T \bigg) = \bigg[\prod_{t=2} ^{t=T} \mathbb{P} (S_t \vert S_{t-1}) \mathbb{P}(\mathcal{C}_t \vert S_t) \bigg] \mathbb{P} (\mathcal{C}_1 \vert S_1) \mathbb{P} (S_1)$.

With a state space model, we let $S_t = f_t (S_{t-1}) + \epsilon_t$ where $f_t$ is a deterministic, and $E(\epsilon_t) = 0.$ Furthermore, $\mathcal{C}_t = g_t(S_t) + \eta_t,$ where $g_t$ is another deterministic function, and $E(\eta_t) = 0.$

The simplest (yet oftentimes tremendously helpful) state space model is the linear Gaussian state state model. It is linear because we assume $f_t$ is linear, it is Gaussian because we assume the noise is Gaussian.

With these assumptions, we obtain the following succinct structual equation like functions

$S_t = \bold{A} S_{t-1} + \epsilon_t$

$\mathcal{C}_t = \bold{B} S_t + \eta_t$

which gives

$S_t = \bold{A} S_{t-1} + \bold{B} \mathcal{C}_t + \epsilon_t.$

where $\epsilon_t \sim Normal.$

##### 5. Bayesian Learning

Consider a model $\mathcal{M}$. With some a priori knowledge regarding the model (e.g. its probability distribution $\mathbb{P}(\mathcal{M})$; for example, $\mathcal{M}$ could be of linear model with Gussian noise), and some parameters of interests ($\theta_{\mathcal{M}}$ specific to the model and its distribution; e.g. $\bold{A}$ and $\bold{B}$ in the linear model above in section 4, or $\mathbb{P}(\theta \vert \mathcal{M})$ in short).

We consider Bayesian learning in the sense that once we have the prior information, we could use data ($\mathcal{D}$) to update our posterior belief about the parameter of interest ($\theta_{\mathcal{M}}$):

$\mathbb{P} (\theta \vert \mathcal{M}, \mathcal{D}) \propto \mathbb{P} (\mathcal{D} \vert \theta, \mathcal{M}) \mathbb{P} ( \theta \vert \mathcal{M}) }$

Now, we could use the previous dynamic connectivity matirces (e.g. $\mathcal{D} := \{ \mathcal{C}_1, \cdots, \mathcal{C}_t \}$) to predict $\mathcal{C}_{t+1}$ via the predictive distribution

$\mathbb{P} (\mathcal{C}_{t+1} \vert \mathcal{D}) = \int \mathbb{P} (\mathcal{C}_{t+1} \vert \theta, \mathcal{M}, \mathcal{D}) \mathbb{P} (\theta \vert \mathcal{M}, \mathcal{D}) d \theta d \mathcal{M}$
##### 6. EM-algorithm and Statistical Inference

When there are no hidden states, the above parameters of interest $\theta$ in Bayesian learning could be used in ML (maximum likelihood) estimation. The resulting estimates are ones that maximize the utility function (e.g a likelihood function).

While with hidden states (treated as missing data), for example, during dynamic connectivities, we can use the EM (expectation-maximization) algorithm (Dempster, Laird, and Rubin (1977); Wu (1983)). It is briefed, in the context of dynamic connectivity, as follows.

Define the utility function

$\mathcal{L} (\theta) := \log \mathbb{P} (\mathcal{C} \vert \theta) = \log \sum_S \mathbb{P} (\mathcal{C}, S \vert \theta),$

where $S$ denotes the hidden state.

Then, for any random quantity $Q(S)$

$\mathcal{L} (\theta) \geq \mathcal{F}(Q, \theta)$

by the Jensen’s inequality, where

$\mathcal{F}(Q, \theta):= \sum_s Q(S) \log \mathbb{P} (S, \mathcal{C} \vert \theta) - \sum_S Q (S) \log Q(S)$

is the free energy.

E-step:

$Q_{k+1} = \arg\max_Q \mathcal{F} (Q, \theta_{k})$

M-step:

$\theta_{k+1} = \arg\max_\theta \mathcal{F} (Q_{k+1}, \theta)$.

Coffee-step:

Let it run; wanna grab a drink, eh?

Finally, inference could be made based upon estimated paramters of interests $\hat{\theta}$. Certainly, if you are interested in obtaining a confidence interval for $\hat{\theta}$, you could bootstrap the data $\{ \mathcal{C}_i \}_{ 1 \leq i \leq T}$ row-wise.

##### References:

[1] Sean R Eddy (2004) What is a hidden Markov model. Nature Biotechnology;
[2] Zoubin Ghahramani (2001) An Introduction to Hidden Markov Models and Bayesian Networks. International Journal of Pattern Recognition and Artificial Intelligence.
[3] Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B. 39 (1): 1–38.
[4] Wu, C. F. Jeff (1983). On the Convergence Properties of the EM Algorithm. Annals of Statistics. 11 (1): 95–103.

## Trouver la Lumière Dans L'obscurité

Last updated: Jan 3, 2017

￼ The light shines in the darkness, and the darkness has not overcome it.

- John 1:5 (English Standard Version)

I am going to write about stability analysis regarding clustering algorithms (which are commonly used investigating brain data): grouping observations (brain measurements) into different clusters (e.g. observation 1 belongs to Cluster A, observation 2 belongs to Cluster B, etc.) based upon their (intrinsic) characteristics or features. But why do I quote John 1:5, include a photo of Louis XIV, and talk about light and darkness?

Because clustering analysis and the challenge thereof remind us of two things. One is that clustering analysis, or any unsupervised learning approach, is to search in darkness for light, as we do not know the true groups, while, very likely, there are true clusters (light) to which they belong, where the member of one group share more similar intrinsic features than non-members; the other is John 1:5 (see above-quoted; thanks to Professor Semir Zeki’s reminder), that indeed there is true group information underlying, yet we, living in the darkness, have not found a reliable algorithm (candles that are affordable and durable) to uncover the turth, or have unfortunately found one that miscasts the truth. The first descibes the goal of clustering analysis, and the second summarizes one of its difficulties.

##### I. Prologue: La Ville Lumière

Today, the city of Paris is referred to as la Ville Lumière (the City of Light) because of its role during le Siècle des Lumières (lit. the Century of Light, or the Age of Enlightenment). But it literally refers to, according to Wikipedia, the fact that Paris was one of the first European cities to adopt gas street lighting in the 1860s.

However, after some further digging, the street lighting in Paris started with candles, rather than gas, dated back to as early as 1666, as according to the Oct. 29 issue of Gazette (by Charles Robinet), it said that “it is now as bright as night as at high noon.” Large candles, “designed to burn for eight to ten hours” putting in 2,736 lanterns composed of glass panels were used. This was initiated by Gabriel Nicolas de la Reynie and supported by Jean-Baptiste Colbert, minister to the King, and the King, Louis XIV, upon whose death in 1715 began the Age of Enlightenment. The introduction of street lighting turned Paris evening from darkness to brightness, and entailed (partially) the Enlightenment. It reminds me of clustering analysis, where the groups of observations are living in the darkness and are awaiting to be unvealed via fast and effective algorithms (affordable and durable candles).

Hence, I shamelessly adopt the above as my prologue.

##### II. Definitions

$\bold{X} = (X_1, X_2, \ldots, X_n)$, where $X_i, i = 1, 2, \ldots$ can be multidimensional, defines data on which a clustering analysis are to be implemented: For example, $X_i$ can be vectorized connectivity matrix for subject $i$.

$\mathcal{A}_k$ is a (clustering) algorithm that is applied on $\bold{X}$ and that renders $k$ groups (clusters) $\bold{Y} = (Y_1, Y_2, \ldots, Y_n)$, where $Y_i \in \{1, 2, \ldots, k \}, i = 1, 2, \ldots$ Technically, $\mathcal{A}_k$ is a map such that $\mathcal{A}_k: Y = \mathcal{A}_k (\bold{X}).$

$\phi$ is a classifier baded upon the above operation engendering $(\bold{X}, \bold{Y})$. In other words, $\phi (\cdot) = \phi_{\bold{X}, \bold{Y}} (\cdot).$ In the case of simple linear regression, $\phi_{\bold{X}, \bold{Y}} = \hat{\beta} (\bold{X}, \bold{Y}) = (\bold{X}^{\prime}\bold{X})^{-1}\bold{X}^{\prime}\bold{Y}.$

The classifier $\phi$ appllied to a new dataset $\bold{X}^{\prime}$ will render clustering of $\bold{X}^{\prime}$ based upon $(\bold{X}, \bold{Y})$ . The algorithm $\mathcal{A}_k$ appllied to $\bold{X}^{\prime}$ will generate clustering of $\bold{X}^{\prime}$ (namely $\bold{Y}^{\prime} = \mathcal{A}_k (\bold{X}^{\prime})$) based upon $\bold{X}^{\prime}$. $\phi(\bold{X}^{\prime})$ and $\mathcal{A}_k (\bold{X}^{\prime})$ could be different.

Hence, we define the normalized Hamming distance to describe such a discrepancy: $d(\phi(\bold{X}^{\prime}), \mathcal{A}_k (\bold{X}^{\prime})) := \dfrac{1}{n}\sum_{i=1}^{n} 1_{\{ \phi(X_i^{\prime}) \neq \mathcal{A}_k (X_i^{\prime})\}}$.

But, incorrect discrepancy may be introduced as follows: for $k=2$, if one algorithm groups the first part all 1’s and the second part all 2’s, and another algorithm clusters the first part all 2’s and the second all 1’s, the two algorithms generate essentially the same result. To avoid this, define (see J. DeJean (2014))

$d_{\mathcal{G}_k} (\phi(\bold{X}^{\prime}), \mathcal{A}_k (\bold{X}^{\prime})) := \min_{\pi \in {\mathcal{G}_k}} \dfrac{1}{n} \sum_{i=1}^{n} 1_{ \bigg\{ \pi \big(\phi(X_i^{\prime})\big) \neq \mathcal{A}_k (X_i^{\prime}) \bigg \} }$,

where $\pi$ is a permutation operation.

##### III. Clustering Analysis

Clustering analysis, by definition, is to cluster observation (potentially of multiple dimensions) into groups, where the within-cluster members are more similar (in terms of distance, for example) than between-cluster members.

There are many types of clustering algorithms (see Madhulatha (2012) for a review, for example). Note that, in the high-dimensional cases, many algorithms are likely to fail due to the curse of dimensionality, and subspace clustering, correlation clustering and projected clustering are alternatively considered.

Here, we will take the k-means approach as an example. We use it because we like to visualize the steps (which is, in my opinion, helpful to gain a deeper understanding of equations and algorithms) and k-means is less cumbersome for us to delineate them pictorially step by step.

Hartigan (1975) introduced the k-means algorithm in detail. Four years later, the journal Journal of the Royal Statistical Society (Series C) published a number of useful and poweful statistical algorithms and included Hartigan and Wong (1979) as Algorithm 136.

To begin, we note that, in general, the k-means algorithm requires input of an $M \times N$ matrix, where M indicates M subjects and N indicates N variables. For example, in functional connectivity analysis, where, for each subject we have a connectivity matrix. Before data input, we, for instance, vectorize (the upper or lower triangular part of) the connectivity matrix into a vector of length N. Hence, we obtain the suitable input data matrix.

The simple goal is to find a map $\mathcal{A}_k$ so that the resulting partition $Y = \mathcal{A}_k (\bold{X})$, where $Y = \{1, 2, \cdots, K}$ have K values. It is equivalent to say that Y corresponds to K partitions of the input data ($\{G_1, G_2, \cdots, G_k}$), where $X_i$ (data from the $i^{th}$ subject) belongs to one of the $G_j,$’s (for $j = 1, 2, \cdots, K$). We further define $N_j$ as the number of points in $G_j,$ and $d(X_i, G_j$ as the (Euclidean) distance between data point (potentially multidimentional) $X_i$ and center of $G_j,$.

The pictorial presentaton simplifies and visualizes (most) critical steps from Hartigan and Wong (1979), for N = 2 and K = 4.

##### IV. Stability Procedure of Clustering Analysis

Here, we outline the stability analysis procedure proposed by T. Lange et al (2004). See also Silhouette by Rousseeuw (1987).

##### References:

[1] J. DeJean (2014) How Paris Became Paris: The Invention of the Modern City;
[2] T. Lange et al (2004) Stability-based validation of clustering solutions.
[3] P. J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. 20: 53–65
[4] J. A. Hartigan. Clustering algorithms. 1975.
[5] J. A. Hartigan and M. A. Wong. 1979.

## An ICA by Any Other Name Would Smell as Sweet: Group ICA

Last updated: Nov 2, 2016

￼What's in a name? that which we call a rose
By any other name would smell as sweet.

- Romeo and Juliet: II, ii, 1-2

Calhoun et al (2009). A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data

## On Rejections

Last updated: Oct 26, 2016

￼ Success consists of going from failure to failure without loss of enthusiasm.

- Sir Winston Leonard Spencer-Churchill

Here are two fascinating resouces regarding rejections. Rejection Pursuit is an article written by Xiao-Li MENG, Dean of Art of Science at Harvard, regarding a paper (now published) that had been rejected for a decade; and Reflections on Rejections includes short stories told by now established leaders in art, science, medicine, and law about their rejections - which have now been gathered into a course on rejections. All these remind me of countless rejections that I had received (and surely do foresee, now with a thankful heart, to have more).

Never give in, never give in, never, never, never, never.

• Sir Winston Leonard Spencer-Churchill </i></span>

## Derivation of the Free Energy Principle

Last updated: Oct 14, 2016

￼Now go we in content
To liberty, and not to banishment.

- As You Like It: Act 1, Scene 3

Below is a brief derivation of the Free Energy Principle (which shows that Eq (2) in Friston et al (2006) ) has a minor typo); excuse my ignorance, though, if this is simply my mistake.

Let us begin by recalling the free energy as:

$F(y, \lambda) = - \int q(v; \lambda) \ln \frac{p(y,v)}{q(v;\lambda)} dv$
$= - \int q(v; \lambda) p(y,v) dv + \int q(v; \lambda) q(y,v) dv,$

which gives Eq (1) in Friston et al (2006) ).

Keeping writing, we have

$\bigg\{ - \int q(v; \lambda) \ln p(v|y)p(y) dv + \int q(v; \lambda) \ln q(v;\lambda) dv \bigg \}$
$= \bigg\{ - \int q(v; \lambda) \ln p(v|y) dv - \int q(v; \lambda) \ln p(y) dv \bigg \} + \int q (v;\lambda) \ln q(v;\lambda) dv$
$= \bigg\{ - \int q(v; \lambda) \ln p(v|y) dv - \int q(v; \lambda) \ln p(y) dv \bigg \} + D \bigg( q(v;\lambda) \parallel p(v|y) \bigg) + \int q(v; \lambda) \ln p(v|y) dv$
$= - \int q(v; \lambda) \ln p(y) dv + D \bigg( q(v;\lambda) \parallel p(v|y) \bigg),$

which gives Eq (2) in Friston et al (2006) ), wherein the first term slightly differ.

## Predictive Functional Connectivity

Last updated: Oct 13, 2016

￼An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts - for support rather than for illumination.

– Andrew Lang

What is creative is the seeking of perfection - and not attaining it.

- Semir Zeki
##### 1.1 Vector Autoregressive(VAR) Process

Consider $\{ \mathbf{X}_i \in \mathbb{R}^d \}_{i \in \mathbb{Z}}$ as a stationalry process satisfying the VAR model:

$\mathbf{X}_i = \mathbf{A} \mathbf{X}_{i-1} + \mathbf{\epsilon}_i, \forall i \in \mathbb{R},$

where $\{ \mathbf{\epsilon}_i \in \mathbb{R}^d \}_{i \in \mathbb{Z}}$ is a sequence of i.i.d. random vectors.

##### 1.2 $\alpha$-mixing Process

Definition (Richard C. Bradley (2015s)). Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and $\mathcal{A}, \mathcal{B}$ be two $\sigma$-fields. The $\alpha$ dependency measure is defined as follows.

$\alpha(\mathcal{A}, \mathcal{B}): = \sup _{A \in \mathcal{A}, B \in \mathcal{B}} \left\vert \mathbb{P} (A \cap B) - \mathbb{P}(A) \mathbb{P}(B) \right \vert$

Let $\{ \mathbf{X}_i \} _{i \in \mathbb{Z}}$ be a stochastic process. For $-\infty \leq M \leq N \leq \infty$, define $\mathcal{F}_$, define $\mathcal{F}_M^N := \sigma \{ \mathbf{X}_i: N \leq i \leq M, i \in \mathbb{R} \}$ as a sigma-field generated by $\{\mathbf{X}_i: N \leq i \leq M, i \in \mathbb{R} \}$. For any $i \geq 1$, we define the $\alpha$-mixing coefficients as

$\alpha(i) := \sup_{m \in \mathbb{Z}} \alpha \big( \mathcal{F}_{-\infty}^m, \mathcal{F}_{m+i}^\infty \big).$

The process $\{ \mathbf{X}_i \} _{i \in \mathbb{Z}}$ is $\alpha$-mixing if and only if $\lim_{i \rightarrow \infty} \alpha(i) = 0$.

##### 1.5 Stochastic Matrix

Reference

[1] Hafner, C. M. and Herwartz, H. (2009) Testing for linear vector autoregressive dynamics under multivariate generalized autoregressive heteroskedasticity, Statistica Neerlandica, 63: 294-323
[2] Hamilton, J. (1994), Time Series Analysis, Princeton University Press, Princeton.
[3] Lütkepohl, H. (2006), New Introduction to Multiple Time Series Analysis, Springer, New York.
[4] Andersen, T. G. (2009), Handbook of Financial Time Series. Springer.
[5] Qiu H. (2015), Robust Estimation of Transition Matrices in High Dimensional Heavy-tailed Vector Autoregressive Processes.

## The Crisis of Big Science

Last updated: September 5, 2016

Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.

- Albert Einstein

Disclaimer: I do not know anything about Physics; but the discussion below is interesting and could generate helpful discussion with regards to general Science.

A Discussion over the Proposed Chinese Circular Electron Positron Collider (CEPC) Project

Currently, there is a warm discussion (Chinese original, Google English translation) between Nobel laureate Chen-Ning Yang and Fields medalist Shing-Tung Yau regarding building the Circular Electron Positron Collider (CEPC).

A Discussion over the Cancelled American Superconducting Super Collider (SSC) Project

Here is an article (2012) titled The Crisis of Big Science by Nobel laureate Steven Weinberg regarding a similar, yet cancelled, project, Superconducting Super Collider (SSC) project.

## Multidimensional Scaling

Last updated: September 5, 2016

There is, it seems to us,
At best, only a limited value
In the knowledge derived from experience.
The knowledge imposes a pattern, and falsifies,
For the pattern is new in every moment
And every moment is a new and shocking
Valuation of all we have been.

- T.S. Eliot, East Coker, The Four Quartets

The idea of multidimensional scaling (MDS) is not new. I am revisiting it; and thought it would be fun to summarize it.

MDS in a one sentence

Given a $n \times p$ data matrix with $n \times n$ distance (similarity or disimilarity) matrix, we aim to find a $n \times k$ (where $k$ can be 2, 3, etc.) map matrix that preserves nearly the same distance information as the distance matrix.

MDS in a nutshell

We here focus on the MDS in the classical sense, where the distance on the MDS map is of the same metric as the original distance (similarity) matrix. Alternatively, MDS could be ordinal or non-metric (see references below for more detail).

One role statistics has is to allow data reduction to make convenient and succinct inference and conclusion in face of large quantity of data. An convenient and succinct approach to describe how similar or (dissimilar) two objects are is to define a distance (e.g. Euclidean) between them; when we have more than two objects, we can store these distances in a similar (or dissimilar) matrix. A natural follow-up question is why is an object more similar (or dissimilar) to one object than another? For objects with obvious distance metrics (e.g. cities with certain distance from one another), we can plot the distance on a distance map (e.g. in 2D case we could have the x-axis revealing west-east orientation while the y-axis indicating north-south information). What, however, about human faces, that have few descriptive distances (quantitative features) available? This is when the MDS becomes powerful. It transforms the similar (or dissimilar) matrix into a new feature matrix where each dimension (though possibly indescribable) presents an underlying feature.

Caveats

The term “multi” in MDS indicates that the feature dimension could be more than two or three (though it would be hard to plot when the dimension is greater than 3). The feature could be ambiguous. It allows us to choose how many underlying features we want; and allows us to choose a small numbwer of features that represent the distances well.

Formally, let $X$ of $n \times p$ be the input data with $n$ subject and $p$ variables; then one can compute a similarity (or disimilarity) matrix $\Delta = \{ \delta_{ij} \} {1 \leq i, j \leq n}$; Finally, the MDS matrix of $D = \{ d_{ik} \} {1 \leq i \leq n, 1 \leq k \leq K}$ such that $\parallel d_i - d_j \parallel \approximate \delta_{ij}$, for $1 \leq i, j \leq I$. Practically, we find

$\hat{d}_1, \ldot, \hat{d}_I = \arg\min _{d_1, \ldot, d_I} \bigg ( \parallel d_i - d_j \parallel - \delta_{ij} \bigg)^2$.

Goodness of fit

We use the stress to measure the goodness of fit

$S:= \sqrt{ \frac{\sum_{i,

where, if $S$ is close to 0 indicates values of MDS ($\delta_{ij}$) are fitting well to the original distances ($d_{ij}$).

References:

[0] Theory: Jan de Leeuw and Willem Heiser (1982) Theory of Multidimensional Scaling .
[1] Review: Kruskal, J.B. and M. Wish. 1978. A Review of Multidimensional Scaling (MDS) and its Utility in Various Psychological Domains.
[2] Introduction: Florian Wickelmaier. 2003. An Introduction to MDS.
[3] Application (a): Moore, Rod 1990. “Ethnographic assessment of pain coping perceptions.” Psychosomatic Medicine 52:171-181.
[4] Application (b): Blank & Mattes “Sugar and Spice: Similarities and Sensory Attributes” Nursing Research 39(5):290-293.
[5] Application (c): Wexler & Romney. 1972. “Individual Variations in cognitive structures.” in Romney, Shepard & Nerlove eds. Multidimensional Scaling: Theory and Applications in the behavioral sciences, Vol II. Seminar Press.
[6] Codes: Quick R.

## Imaging Process Techniques in Neuroimaging

Last updated: August 28, 2016

If I paint a wild horse, you might not see the horse; but surely you will see the wildness.

- Pablo Picasso

Neural Painting

Many of you have used software (e.g. Instagram) to adjust or enhance your photoes, and a variety of apps that converts photos into ones that resmeble different art styles such as watercolor and oil, or bear artistical styles such as impressionist paintings); some of you may even heard of neural painting that uses deep learning algorithms to convert any image to a painting that bears a particular painter’s artistic style (see an illustration below), based on (Gatys et al, 2015). There are yet released codes from the paper; we can, however, following the steps on this site (by Andre Infante) to creat such images ourselves; or by sending a source image and style image to the Deep Forger bot to get an transformed image.

From left to right are: source image (Eiffel Tower), image of desired artistic style (Starry Night Over the Rhone (aka the other Starry Night) by van Gogh, 1888, oil on canvas), and the output image. (Image credit: Andre Infante)

Imaging Process

While I do not have much experience in this, I am extremely interested in exploring potentials of it, and have gathered some helpful resource (please refer to the further reading section below) that I would like to share with readers interested in this.

In general, imaging process can be summarized in to the folowing procedures: (1) represent images in image formats (we could think of this as storing image as a cluster of pixels each of which has an intensity); (2) apply image filters (e.g. such as point operations, area operations, and automatic color adjustments) (we could think of this as using a variety of transformations to adjust orientation, contrast, brightness, etc); (3) other cosmatic procedures (such as denoise and “dark frame” subtraction, namely to fix bad pixels), and (4) conduct image analysis (such as object recognition, segmentation, motion detection, and video tracking).

Here, let us consider an example in 2D to delineate a simple case.

First, let us script an image from Esther Honig, a journalist who did a famous study to present how standards of beauty differ across various cultures. This is the image we consider here. Second, we add $n \times m$ $N(5,2)$ noise to pixels (see image in the left below). Finally, we apply wavelet transform to obtain the process image (see image in the right below).

Here is the R code.

Readers interested in the nitty-gritties could refer to resouces below.

[1] An Introduction to Pixel Molding (Øyvind Kolås) . The notes are written in accordance with the Gluas app written in Lua scripting language; codes, however, are provided so they can be easily modified to cope with R, Matlab, etc.
[2]Lectures on Image Processing (Alan Peters, 2016) .
[3] Image Process.
[4] An Introduction to Mathematical Image Processing (Luminita Vese, 2010) .
[5] Image Programming Algorithms (Francis Loch) .
[6] Journal of Image Processing Online .

## On Intuition: A Bayesian View

Last updated: August 18, 2016

When to the sessions of large data thought,
I summon up remembrance of things past,
I sigh the lack of many priors I sought,
And with old prior new experiment my dear posterior's based.

- Adapted from a part of Shakespeare's Sonnet XXX

To
THOMAS BAYES

WE propose that Professors Mikhail Filippov, Varun Prasad and Semir Zeki’s definition of intuition could be mathematically formulated in a Bayesian framework.

An intuition is an unconscious logical brain process with an outcome or conclusion in the form of a statement or proposition. But whether the outcome of the intuitive process is “right” or “wrong”, or “correct” or “incorrect”, can only be determined by a conscious logical process.

First, for simplicity, define an event as $\theta$. Here, $\theta$ is a random variable, a random quantity whose value is subject to variations due to chance. For example, looking into the mirror, you think your height ($\theta$) is 72 inches. However, your visual judegement may not always be correct, or that the mirror may not be perfect. Hence, we could put a probability measure $Pr$ on $\theta$. If you strongly believe in your judegement, then the probability density $Pr(\theta)$ is rather spiky; for example, it covers the height from 71-73. On the other hand, we may have less confidence when a visually-impaired person claims that he finds himself 72 inches tall. In that case, the probability density is rather flat; for example, it covers the height from 61-83. We call $Pr(\theta)$ the prior. Another wonderful example of $\theta$ is the location of where a tennis ball lands in a match ( Körding and Wolpert, 2003).

Next, we define the data observed from experiments as $D$. For example, the measured height. Measured heights have errors, too. Furthermore, the measurement error for a child ($\theta = 50$) is in general smaller than it for a basketball player (e.g. Dwight Howard at $\theta = 83$). Therefore, at (conditioning on) every $\theta$, we define the likelihood measure as $Pr (D \vert \theta)$.

Finally, we define the intuition of an event given experimental data ($D$ ) as $\mathcal{I}(\theta \vert D)$.

By Bayes rule, we have:

$\mathcal{I}(\theta \vert D) = \frac{ Pr(D \vert \theta) Pr (\theta) } {\int_{\Theta} Pr(D \vert \theta) Pr(\theta) }$

It follows, $\mathcal{I}(\theta \vert D) \propto Pr(D \vert \theta) Pr (\theta)$, where $\propto$ stands for proportional to . Let us use the height example to explain this: $\mathcal{I}(\theta = 72 \vert D = d) \propto Pr(D = d \vert \theta = 72) Pr (\theta = 72)$, which indicates that, our intuition about one’s height being 72 inches, given observed data ($D = d$) depends on how well the measurement is, and how strongly our prior information of the height being 72 is. If our previous experience leads us to strongly believe the height is 72 (i.e. $Pr (\theta = 72)$ is large), our intuition that the height is 72 would be strong.

The above equation translates as follow: intuition is based upon the data available and the prior experience information. This conclusion echoes with the statement that Professors Mikhail Filippov, Varun Prasad and Semir Zeki made:

We believe, however, that to have an intuition in any area, one must have experience of that area or knowledge of it, to provide a conclusion or statement, whether correct or incorrect.

Certainly, intuition could change overtime, because our view (prior information) of the world changes. Hence, we could update our Bayesian intuition to one that is “dynamic”, namely:

$\mathcal{I}(\theta (t) \vert D) = \frac{ Pr(D \vert \theta (t) ) Pr (\theta (t)) } {\int_{\Theta } Pr(D \vert \theta (t)) Pr(\theta (t)) }$

This may be studied with Dynamic Bayesian Network (DBN) by Paul Dagum and Recursive Bayesian estimation (Bayes filter). I have little knowledge of this, interested readers could refer to corresponding papers.

Other Interpretations

Biologically, intuition could be explained by early brain signals preceding decisions (see Benjamin Libet, Haggard and Eimer, Soon et all, and Schultze-Kraft).

The thought-provoking post on intuition, and other posts regarding the implication of mathematical beauty on asynchronous brain operations, could be found on Professor Semir Zeki’s blog.

Professor Zeki and I would like to encourage discussion and comments. One question I would like to hear our readers’ opinion is: in the above post, I interpret intuition as the posterior that is based upon data and prior information (the latter of which is based upon previous training or experience). However, one could argue that intuition is indeed the prior; and with data, we update and form our belief (the posterior).

Further interested readers on these topics could refer to Professor Semir Zeki’s books: A Vision of the Brain, Inner Vision: an exploration of art and the brain, Splendors and Miseries of the Brain, and Balthus ou la quête de l’essentiel, and La bella e la bestia: arte e neuroscienze.

## Time Series Analysis in Neuroscience

Last updated: August 10, 2016

If you can look into the seeds of time
And say which grain will grow and which will not,
Speak, then, to me, who neither beg nor fear

- Macbeth act 1, sc. 3

I have spent the past two weeks on the road and air, with intermittent internet access. While having not had the opportunity to keep up what is happening around the world, I had the fortune to re-read Professor Richard Davis’s wonderful book on time series and forecasting on the go. While the book is quite long (400+ pages), I have started to summarize and organize a few (subjective) key elements from the book for future reference. If you are starting to learn time series, I personally think the note might give you a succinct summary of the models used. You can find them here. (I will update it regularly). Certainly, the more times I read a book, the more unknowns I found. If you have any recommendation or critism regarding the note, or want the latex file so you can edit it yourself, please let me know.

The reason I started to re-read the book was (besides having no internet on the go) that time series analysis is very useful in studying brain data on the time and spectral domain. Here is an overview by Professor Hernando Ombao; and Professor Tohru Ozaki has a book called Time series modeling of neuroscience data.

My friend Huitong has written two very beautiful papers (see [6] and [7] below) on high dimensional heavy-tailed time series (which I have little knowledge of, but we had some discussion upon this topic and the potential of extending them in studying “causality” between different brain regions). I have the papers and some interesting references listed below.

[1] Granger, C. W. J. (1969), Investigating causal relations by econometric models and cross-spectral methods, Econometrica, 37: 424-438.

[2] Hafner, C. M. and Herwartz, H. (2009) Testing for linear vector autoregressive dynamics under multivariate generalized autoregressive heteroskedasticity, Statistica Neerlandica, 63: 294-323

[3] Hamilton, J. (1994), Time Series Analysis, Princeton University Press, Princeton.

[4] Lütkepohl, H. (2006), New Introduction to Multiple Time Series Analysis, Springer, New York.

[5] Andersen, T. G. (2009), Handbook of Financial Time Series. Springer.

[6] Qiu H. (2014), Robust Portfolio Optimization under High Dimensional Heavy-tailed Time Series;

[7] Qiu H. (2015), Robust Estimation of High Dimensional Heavy-tailed Vector Autoregressive Processes.

## From Network Strength to Effective Network Strength

Last updated: July 28, 2016

You may say I'm a significant voxel, but I'm not the only one. I hope someday you'll join us. And the brain network will live as one.

- Adapted from John Lennon's Imagine

In this post, we shall propose a framework called the effective network strength, extending the definition of network strength introduced in the wonderful Nature Neuroscience paper by Rosenberg et al (2016). As a natural follow-up, we also define the effective network strength rate.

1. Network strength

Rosenberg et al (2016) introduced a neuromarker based on intrinsic whole-brain functional connectivity, the degree to which brain activity in distinct neural regions is correlated over time, and showed that functional brain networks strength during a sustained attention task predicted individual differences in performance.

To characterize participant’s degree of connectivity, Rosenberg et al (2016) defined network strength, a single summary statistic, which is equivalent to a weighted degree measure for each network. Formally, let us define $S_k$, as the network strength for subject $k$. Then

$S_k = \sum_{i,j} \rho_{ij}^{(k)} I_{ijk},$

where $\rho_{ij}^{(k)}$ is the $\{i,j\}^{\text{th}}$ entry of the connectivity matrix (here the connectivity matrix is computed by taking the node-wise correlation between node-specific time series); and $I_{ijk} = 1$ if $\rho_{ij}^{(k)}$ follows a specific thresholding rule (e.g. $I_{ijk} = 1$ if $\rho_{ij}^{(k)} \geq 0.8$), and 0 otherwise.

2. Effective network strength

However, a hyperactive subject’s brain could be extremely active during both the resting state and the task state; hence the large correlaton coefficients between nodes that are extremely active during both resting and task states could overshadow the functional changes of other edges. For example, if for a subject there are 100 nodes that are highly correlated during both resting and task states with $\rho = 0.99$ each (or without significant changes, say from $\rho = 0.98$ to $\rho = 0.99$), the change of other edges from 0.001 to 0.1 (despite statistically significant and plausiblly neurobiologically meaningful) would likely to be ignored if we are interested in the node-wise (thresholding) sum (as in Rosenberg et al (2016)). Therefore, we propose the definition of effective network strength as follows:

$S_k^{eff} = E(S_k \vert X_k = 1) - E (S_k \vert X_k=0),$

where $X_k = 1$ refer to the task state, and $S_k = 0$ refer to the resting state.

Effective network strength describes the average change in network strength due to a change in experimental status. It is experimental dependent; and for numerical tasks, it could be written as $X_k = x$, for some task level $x$.

As you may have realised, the idea of effective network strength was inspired by effective connectivity (Friston (2011) and Aertsen and Preißl (1991)). Nonetheless, effective connectivity regards the effect one neuron has on another and is time dependet, whereas effective network strength is time-independent as when taking the connectivity matrix, the time is averaged out.

3. Network strength with multiple inputs

What if we have more than one tasks? And what if are interested in effective network strength between male and female, or between different age groups? To address these interesting questions, we define the conditional network strength on different inputs.

$S (\boldsymbol{x}) = S (X_1 = x_1, X_2 = x_2, \cdots, X_n = x_n)$

where $X_i$’s define tasks, and $x_i$ is its realization.

For example, $S(\text{treatment} = 1, \text{gender} = 1)$ defines network strength under a task for male (gender = 1); and $\{S_{gender}^{eff} \vert \text{treatment} = 1 \} = E(S \vert \text{gender} = 1, \text{treatment} = 1) - E (S \vert \text{gender} = 0, \text{treatment} = 1),$ defines the gender effect on effective network strength conditioning on treatment.

4. Effective network strength rate

Once we have defined the effective network strength, it is natural to define the effective network strength rate: the change (note: could be decrease) of network strength per unit treatment increase.

$\nabla = \frac{\Delta S}{\Delta \boldsymbol{x}} = \frac{S(\boldsymbol{x} + \Delta \boldsymbol{x}) - S (\boldsymbol{x})}{\Delta \boldsymbol{x}}$

Certainly, when the task is multidimensional $\boldsymbol{X} = \boldsymbol{x} = (x_1, \cdots, x_n)$, the effective network strength rate could be specified at every task direction, namely $\nabla = \bigg( \frac{\partial}{\partial x_1}, \cdots, \frac{\partial}{\partial x_n} \bigg)$.

5. Network Strength in a Boltzmann Machine Global Energy Sense

The energy of the mind is the essence of life.

• Aristotle </i></span>

The global energy in a Boltzmann machine, $E$, is defined as follows:

$E = - \bigg ( \sum_{ij} \rho_{ij} s_i s_j + \sum_i \epsilon_i s_i \bigg )$

where $\rho_{ij}$ is the connection strength between unit $i$ and unit $j$ (comparing with node-wise correlation in network strength), $s_i \in \{0, 1\}$ is the state for unit $i$ (comparing $s_i s_j$ with $I_{ij}$ in network strength), and $\epsilon_i$ is the bias of unit $i$ in the global energy (this is not present in network strength - yet it could because data driven node-wise correlation is likely to have bias).

Due to the properties of Boltzmann energy function and its similarity with network strength function, at resting state (and possibly at task state), when the node-wise correlation could be updated, it will be interesting to show that the network strength function (1) is a Lyapunov function, and (2) has Markov property. (1) indicates that the network strength will eventually converge to a state which is a local minimum; and (2) tells us that we could impose probability measure on network strength.

## On Brain Fingerprinting: Extensions

Last updated: July 20, 2016

￼The unique fingerprint of every individual defines our unique purpose and mission on earth.

In this post, we shall discuss the wonderful brain fingerprinting paper published on Nature Neuroscience by Finn et al, and present a few potential extensions.

1. Recap

First, let us briefly describe the dataset and the fingerprinting procedures in Finn et al (2015). The study has 126 subjects, each subject has 268 nodes, each subject is measured at $T$ time points (rigorously we should write $T_i, \forall 1 \leq i \leq 6$ because the time points vary between sessions), and there are six image sessions: two resting sessions (rest 1 [on day 1], rest 2 [on day 2]) and four task sessions: working memory, emotion, motor and language (two on each day). Therefore, the dataset is

$126 \times \underbrace{268 \times T}_{\text{compute} \hspace{2mm} 268 \times 268 \hspace{2mm} \text{node-wise correlation matrix}} \times 6.$

For every subject at each image session, we could convert the $268 \times T$ matrix into a $268 \times 268$ node-wise correlation matrix. Therefore, at each session, we have 126 subject-specific node-wise correlation matrix. Next, we vectorize each node-wise correlation matrix. Finally, consider subject $i$ from one image session (say Rest 1). To find a matching subject from another image session (say Rest 2), we take $\bold{v}_i ^{(1)}$, the vectorized node-wise correlation matrix for subject $i$, where the superscript indicates it is from the first image session, to each of vector in the Rest 2. The subject identified is the one who achieves

$\rho_{i j*}^{(1,2)} = \arg\max_{1^* \leq k^* \leq 126^*} \rho_{i k^*}^{(1,2)}.$

where $\rho_{i j*}^{(1,2)} = corr(\bold{v}_i^{(1)}, \bold{v}_{j^*}^{(2)}).$

Notice, the $*$ indicates that subject $15^*$ in Rest 2 may differ from subject 15 in Rest 1.

2. Extensions

2. 1. Brain topology v.s. brain function on identication accuracy

The paper claims that identification power came from true differences in functional connectivity rather than anatomic idiosyncrasies. To support the claim, the authors used kernel smoothing on connectivity matrices, and showed that, with larger smoothing kernels (under which the registration advantages for the same brain compared to different brains should be vastly reduced or eliminated), only a slight drop in identification power. One may argue that smoothing does not eliminate differences caused by topology fully. Let us take an extreme example: consider two subjects (S1 and S2) with significant anatomical difference in 100 adjacent nodes, where the brain measurements of S1 for these nodes during two sessions are nonzeros, and those of S2 for these nodes during two sessions are all zeros. In this case, for those 100 nodes, a large smoothing kernel (unless an infinite bandwidth is chosen), will give S1 small but nonzero smoothed brain measurements whereas S2 has all zeros. Therefore, S1 in one image session would still match S1 in another session, and so is S2 - this could potentially argue against the functional connectivity claim.

To better eliminate anatomic idiosyncrasies, we need brains that are particularly similar anatomically. One could start by looking at the HCP twin data, concerning that the brain topology between identical twins is almost the same. Using the HCP twin and non-twin data, we could then ask three questions: is brain fingerprinting (1) due to brain functions alone, (2) due to brain topology, or (3) due to a combination of (1) and (2). Let us call the identication accuracy rate for twins as $T$ and it for non-twins as $NT$, both of which are in percentage. If $T \sim NT$, then we conclude that brain fingerprinting is not due to topoloty; if $0 << T < NT$ (i.e. we still achieve some accuracy using twin data, but not as good as if we use non-twin data) then we conclude that brain fingerprinting is due to a combination of brain function and topology; and if $T \rightarrow 0, NT >> 0$ (i.e. we almost cannot fingerprint twins), then we conclude that brain fingerprinting is due merely to brain topology. It would be interesting to see how much identification accuracy is due to functional connectivity and how much due to anatomic idiosyncrasies, via proposing a Functional/Anatomic Ratio Score, by studying the twin and non-twin data.

2. 2. “Dynamic” brain fingerprinting

The study showed that “longer time courses better preserved individual characteristics in connectivity profiles”. However, it would be interesting in examining if the identification accuracy rate changes across the time of the day. This question is driven by studies we did on physical activity data analysis, where our daily physical activity on average follows a “M” shape. It will be tremendously interesting if the identification accuracy rate across time somewhat matches the physical activity pattern (especially during task sessions) - which would indicate that stronger physical activity could possibly facilitate brain fingerprinting. See below.

Here, we present the brief procedure. First, divide the total time points in a time $T$ into $N$ periods, each with $t$ time points, i.e. $T = N \times t$. For every $t$ time points, we repeat the fingerprinting procedure discussed in the paper, and obtain $N$ identification accuracy rates, and call them $R_1, \cdots, R_N$. Finally, we plot $R_1, \cdots, R_N$ against $N$ time points of the day.

2. 3. Age, gender, etc. effects on identification accuracy rate
It would be interesting to investigate the age effect on identification accuracy rate. If there is a obvious difference identification accuracy rate between age groups, one may argue that this is due to functional difference related to age. Potentially, there could be six types of age effect trends (see below). Based on the trend, the age effect on identification accuracy rate would gain us neurobiological understanding of the brain functional similarity due to age: for example, the top left image below would indicate that the brain functions are the most stable (similar between two sessions) during one’s adulthood while not so much during childhood (possibly in that kids’ brain signals are super active one day and less so the second day or their state of mind has more variations - possibly not yet functionally matured), nor during ederly population (possibly due to lesions); the image to its right would indicate that age seems to not affect measured brain functions significantly.

Similarly, one may look at the diffence of identification accuracy rate between male and female.

2. 4. Neurodegenerative Disease Similarities due to Brain Functions
One may be interested in seeing association between differentn neurodegenerative diseases due to functional similarities. For example, one may ask, amongst patients who develope Parkinson’s (PD) , Alzheimer’s (AD), and Huntington’s (HD) diseases, are PD patients brain functions more similar to AD’s, than to HD’s?

Let us take an example to describe the procedure. Consider two task-based imaging sessions: working memory and motor (we choose these two because AD and PD patients’ brain functions may be correlated with respect to these two tasks). Now for subjects under the working memory task group, we could group them according their disease status. For those in the motor group, we could randomize the order of subjects (because our goal is to see how well we match patients in different disease groups). Next, we vectorize each correlation matrix; and calculate correlations between these vectors in two task groups. The results are stored in a disease-specific (dis)similarity matrix due to brain functions.

Certaily, the same subjects would likely be accurately matched; nonetheless, it would be interesting to see if the subjects in the same disease group are matched. The grouping is present in a disease-specific (dis)similarity matrix.

2. 5. Brain-Genome Fingerprinting Whilst genomic information may assist identify subjects alone (I have little knowledge about whether there are exisiting works on this), to match subjects using brain functions along with genomic information during two sessions gains us little because genes for the same subject do not vary much during these sessions; adding brain measurements into gene information would even likely to reduce identification accuracy rate. Nevertheless, it would be interesting to see if we could match one’s brain functions with gene sequence, and vice versa. However, one may argue, if the matching is accurate, we are actually matching gene sequence with brain structure, rather than with brain function. I am not able to adress this, even with the twin data in (2.1), because identical twins are likely to have similar brain structures, and have the same DNA (assuming SNPs are ignorable).

2. 6. Fingerprinting Defining Matrix

Consider $Y_1$ and $Y_2$, each of which is $n \times v$, where $n$ is the number of subject and $v$ is the number of vectorized nodes, as data from two sessions. In particular, for example, the first row of $Y_1$ is the node-wise correlation matrix for subject 1 in session 1 vectorized.

We then define a $n \times n$ (where $n$ is the number of subjects) Fingerprinting Defining Matrix, where if (consider row-wise) the $(i,j)^{\text{th}}$ entry is the largest among all entries in row $j$, we say subject $i$ in session 1 matches subject $j$ in session 2.

The Fingerprinting Defining Matrix $M$ of dimension $n \times n$ satisfies:

$M = \arg\min_{ X } \{ \parallel X \parallel _{\mathcal{F}}: Y_1^T X Y_2 = V_1^2 V_2^2 \},$

where $V_1$ and $V_2$ are scaling matrices of $Y_1$ and $Y_2$, respectively.

In other words, there are a lot of matching matrices that allows $Y_1^T X Y_2 = V_1^2 V_2^2$ (i.e. obtain the “product variance” of data from both sessions); but $M$ is the matching matrix with the smallest matrix norm to achieve so.

Indeed, the correlation matrix $P = Y_1^T Y_2$ achieves that $Y_1^T P Y_2 = V_1^2 V_2^2$ because $Y_1^T P Y_2 = D_1^T V_1 S_1^T (S_1 V_1 D_1 D_2^T V_2 S_2^T) S_2 V_2 D_2 = V_1^2 V_2^2$. However, I do not have a proof that the correlation matrix is the matching matrix that has the smallest matrix norm to achieve $Y_1^T X Y_2 = V_1^2 V_2^2$. It would be interesting if someone could prove this (or disprove this). If the former, then we will have a mathematical interpretation for the correlation matrix to be a fingerprinting matching matrix that matches the data from two sessions to abtain the “product variance” from the two data sets.

2. 7. Connecting Brain Fingerprinting with Encoding models, RSA, and Decoding Models

The connectivity matrix approach used in Finn et al (2015) is in essence identical to it used in RSA (Kriegeskorte et al, 2008). A few extensions could be made linking these approaches (I will write another post on this on a later date). Here, I would like to discuss the behavioral prediction discussed in the paper and potential extensions.

The paper uses feature network strength to predict gFscore (fluid intelligence) score. For subject $k$

$gF_k (X_k) = \beta_0 + \beta_1 X_k + \epsilon_k,$

where $X_k = \sum_{i,j} \stackrel{(k)}{\rho^{(+)}_{ij}}$ is the positive feature network strength for subject $k$, which sums up all significant (by thresholding) postive correlation $\rho_{ij}$’s. The model for negative feature network strength could be Similarly defined.

The prediction is well done. However, since the correlation between predicted and observed gF scores was $r = 0.50 (p< 10^{-9})$ for the positive-feature model and $r = 0.26 (P = 0.005)$, it would be interesting to see if there are things we could improve upon.

First, we could check if a subregion of connectivity matrix (e.g. only take the part of the connectivity matrix whose nodes correspond to Broca’s and Wernicke’s areas) under specific tasks (e.g. language) would predict gF better. Second, since the authors used simple linear regression to train the data, we could investigate if there is a model that better fits the training data (and hence have a better leave-one-out prediction). For example, a non-linear model

$gF_k(X_k) = f(X_k) + \epsilon_k,$

where $f$ is some non-linear mapping. Or a semiparametric model

$gF_k(X_k) = f(X_k, \theta, \gamma) + \epsilon_k,$

where $\theta$ is some finite-dimensional parameter of neurobiological interest, and $\gamma$ is a nuisance parameter (such as the BOLD signal variance in each node).

3. Conclusion

In summary, Finn et al (2015) is a tremendously interesting paper that demonstrated existing subject-specifi functions in the brain. I eagerly look forward to more exciting follow-up works from the group.

## On Task-based Age and Gender Effects on Spatial Brain Representations: with Applications to the Human Connectome Project

Last updated: July 16, 2016

￼Biology gives you a brain. Life turns it into a mind.

– Jeffrey Eugenides

I had the idea of this post after an inspirational conversation with Professor Ruben Gur, who, together with his wife, Professor Raquel Gur, are world-renowned experts in neuropsychology, and have done a tremendous amount of work in gender and age differences in the brain.

Here, I would like to propose a functional mixed effect model to investigate how gender and age effects on encoded (representational) spatial brain signals under (i.e. conditioning on) different tasks. This framework can be implemented to study the 500-subject datasets from the Human Connectome Project, with a goal to understand age and gender effects on the following main categories: alertness, cognition, emotion, motor function, personality, sensory, substance use, etc. Each main category could be further divided into subcategories: e.g. one may be interested in studying age and gender effects on emotion, in particular, psychological well-being.

We consider a marginal model for the brain activity data. Specifically, for subject $i$ at visit $j$, consider:

$y_{ij}(\nu_k) \vert x_{ij}, z_i = \alpha(\nu_k, x_{ij}) + z_i \beta (\nu_k, x_{ij}) + \epsilon_{ij} (\nu_k),$

where $\nu_k \in \mathcal{V}$ indicates the $k^{\text{th}}$ voxel (for simplicty, we could vecterize voxels in a 3-D brain to a long vector), $y_{ij}(\nu_k)$ is the voxle-specific intensity, $x_{ij}$ is the age, $z_i$ is the gender, and the error follows a working model

$\epsilon_{ij} (\nu_k) = b_i (\nu_k) + w_{ij} (\nu_k) + \epsilon_{ijk},$

where $b_i (\nu_k)$ models the subject-specific mean function of brain activity, $w_{ij} (\nu_k)$ models the deviation of the $j^{\text{th}}$ visit from the subject-specific mean function, and $\epsilon_{ijk} \sim N(0, \sigma_\epsilon^2).$ Furthermore, we model $b_i (\nu)$ and $w_{ij} (\nu)$ by two zero mean Gaussian processes with spatial covariance functions $\Sigma^b(\nu,v)$ and $\Sigma^w(\nu,v)$, respectively. Finally, we assume that $b_i (\nu)$, $w_{ij} (\nu)$, and $\epsilon_{ijk}$, are mutually independent.

To interpret the model, here $\alpha(\nu, x)$ is the brain activity mean surface, specifically, $\alpha(\nu_k, x_{ij})$ gives the mean brain activity for voxel $\nu_k$ and age $x_{ij}$; $\beta(\nu, x)$ is the brain activity slope surface, specifically, $\beta(\nu_k, x_{ij})$ indicates the brain activity difference due to gender for voxel $\nu_k$ and age $x_{ij}$.

It would be relatively straightforward to implement computational wise if one’s research focus is on ROIs (in other words, the number of voxels is reasonably small). When the number of voxels is large, we may run into computational problems, because we are essentially dealing with a huge surface of $N \times p$, where $N$ is the number of ages and $p$ is the number of voxels. We may then consider dimension reduction or assume that the brain network is sparse. We shall not expatiate on this here; interested readers could consider approaches such as SVD, fPCA, ridge regression, etc.

Readers who are interested in the estimating procedures of the above model could discuss with me via email. If one is interested in learning more about functional data analysis, please refer to the following wonderful resources: Ramsay (2006) proivdes a broad overview of functional data analysis methods with applications to curve and image analysis; and Ruppert et al. (2003) overviews functional data analysis in the semiparametric framwork in detail.

This framework can be implemented to study age, gender, and age-gender effects on sensory data (audition, olfaction, pain, taste, vision), on substance use (alcohol, tobacco, and drug), psychiatric and life function, motor (endurance, locomotion, dexterity, and strength), emotion (emotion recognition, negative affect, psychological well-being, social relationships, stress and self efficacy), cognition (episodic memory, executive function/cognitive flexibility, executive function/inhibition, fluid intelligence, language/reading decoding, language/vocabulary comprehension, processing speed, self-regulation/impulsivity, spatial orientation, sustained attention, verbal episodic memory, working Memory, and alertness (cognitive status, and sleep). Each area could be developed into a journal article using the above framework.

## On Generating Artificial Functional Magnetic Resonance Imaging (fMRI) Data (II): Dynamic 3D Visualization

Last updated: June 7, 2016

￼The human brain starts working the moment you are born and never stops until you stand up to speak in public.

– George Jessel

Once we understand the data generation mechanism of fMRI data, and have become familiar with the parameters needed, we can proceed to simulate desired fMRI data. There are many wonderful articles providing examples and codes for simulating fMRI data (see, for example, Welvaert et al). For a list of software for medical image analysis, see here.

Here I shall focus on visualization of simulated fMRI data, so as to examine the quality of the realizatio of the simulation, such as, if we are to simulate fMRI data for a regional brain lesion, do the simulated fMRI datasets accord with real world scenarios? Additionaly, I am interested in seeing the intensity of signal changes across time. This has real world implications. For example, visualizing the brain dynamic functional connectivity; visualizing the growth of tumors; and etc. Let us consider a concrete example. Consider a patient with two brain regions associated with abnormal activities (e.g. the patient has two tumors); and we wish to simulate fMRI data in this scenario. Below, on the left we present dynamic 2-D intensity of simulated data for a slice (a cross-section) of the patient’s brain during 100 scans (with TR = 2); and on the right we present the dynamic 3-D image of threshold activities for the first 20 scans. Notice the two regions with intensive signals.

Please send me an email if you want the code for generating above dynamic 2D/3D plots.

For generating static 3D plot, there is a terrific article written by Muschelli et al.

For generating static 2D plot, there is a terrific article written by Tabelow et al. The data are here. Here is some further reading.

## On Generating Artificial Functional Magnetic Resonance Imaging (fMRI) Data (I): Generating Mechanism

Last updated: June 5, 2016

Understanding how the brain works is arguably one of the greatest scientific challenges of our time.

– Alivisatos et al

A useful approach to develop one’s skills in investigating human brain data is to understand incisively the data generation mechanism. At the low level, one may look at the biophysical basis of action potential, spike-timing dependent plasticity, neuronal dynamics, etc. I have little knowledge in these areas; but interested readers can refer to Bogacz, Lengyl, Pfister, amongst other great scientists working in these fields. Here, I aim to discuss the data generating mechanism at the voxel level, mainly, fMRI data. I have learned a tremendous amount from two wonderful papers: (1) Eloya et al; and (2) Welvaert et al. The post serves as a self-learning guide, as I selectively walk through a few key steps. Here is a useful reference of MRI glossary. I shall write another post in the near future discussing the importance of knowing the exact data generating mechanism on making “causal” claims, etc.

#### Part 1: Model Basics

First, fMRI data can be decomposed into two parts: (a) signals (experimental design caused activation, or resting state activation); and (b) noises. Namely,

$s(t) = h(t) + \epsilon(t),$

where $s(t), h(t),$ and $\epsilon(t)$ are signal response, activation function, and noise, respectively, at time $t$.

#### 1.1 Signals

The activation function $h(t)$ is convolved from $h(t) = f(t)*g(t) = \int_{-\infty}^{\infty} f(\tau) g (t-\tau) d\tau$, where we can treat $f(t)$ and $g(t)$ as a stimulas function (dependent upon experiemental design) and a “weight” function (a haemodynamic response function (HRF)), respectively. Then $h(t)$ is, intuitively, a weighted (by $g(t)$) average of $f(t)$ over the real line. For example:

There are a few choices of HRFs ($g(t)$), e.g., Gamma, double-Gamma, and the balloon model. Here, we take the Gamma function as an example. Consider the gamma-variate HRF function:

$g(t) = \frac{1}{k \tau_h (k-1)} \left( \frac{t}{\tau_h} \right)^k \exp{\{- \frac{t}{\tau_h}\}}$

with $k=3$ and $\tau_h = 0.242$;

Consider a block function $f(t)$ (corresponding to a block experimental design; and in the following figure, the black lined blocks), then the convolved activation function $h(t)$ using gamma-variate HRF function is shown below in dashed blue:

#### 1.2 Noises

There are four types of noises: (1) white noise, (2) colored noise, (3) temporal noise, and (4) spatial noise.

##### 1.2.1 White Noises
1. a. It is a system noise that is part of the fMRI data.
2. b. It could be Rician distributed (for fMRI data with low signal-to-noise ratio, SNR); or Gaussian distributed (higher SNR, > 10).
##### 1.2.2 Colored Noises
1. a. It depends upon the signal, the timing or the location
2. b. For signal-dependent colored noises, it could be (a) low-frequency drift (due to slow fuctuations in the scanner hardware; the drift is modeled as a basis of discrete cosine functions); (b) physiological noise (models the variability in the signal that is caused by cardiac and respiratory artefacts; the noise is modeled as sine and cosine functions); and (c) task-related noise (accounts for spontaneous neural activity due to the experimental task; modeled by adding random noise where and when activation is present; and it can be interpreted as residual noise from head motion that is not removed in the pre-processing stage).
##### 1.2.3 Temporal Noises
1. a. Due to the fact that fMRI data are repeated measurements;
2. b. It could be modeled as an $AR(p)$ model: $\epsilon(t) = \sum_{i=1}^p \rho_i \epsilon_{t-i} + \eta_i,$ where $\eta_i \sim N (0, \sigma^2)$.
##### 1.2.3 Spatial Noises
1. a. It models the spatial dependencies in fMRI data: adjacent voxels are more correlated than those that are far apart;
2. b . It could be modeled as (1) an AR correlation structure, (2) a Gaussian random field, and (3) a Gamma random field.

As an illustration, below on the left, we present a figure with task-based noise; and on the right, a 3-D view of a slide (20 by 20) of Gaussian random field spatial noises.

#### Part 2: Parameter Specification

Now that we understand the general mechanism of fMRI data generation, we need additional information such as when does the activation start for each experimental design, where does the activation occur in the brain, etc., in order to simulate real-life-like fMRI data. The additional information consists a few parameters:

• onset time;
• duration;
• number of scans;
• TR;
• hrf;
• effect size;
• SNR;
• noise type: low-frequency, physiological, task-related, mixture;
• spatial noise: AR correlation, Gaussian random field, Gamma random field;
• weights of each type of noise;
• noise type: gaussian, rician;
• the number of activated regions;
• the coordinates of the regions;
• the radius of the region;
• the shape of the region (cube and sphere); and
• other parameters

• Other useful packages includes:

• Package: oro.nifti
• write and read AFNI files;

## Get Started with SPM in Ten Steps

Last updated: July 18, 2016

￼My dull brain was wrought. With things forgotten.

– William Shakespeare

Many a friend of mine ask me about how to get started with Statistical Parametric Mapping (SPM) quickly. Being an R user primarily, I find myself incompetent to answer their question rigorously. However, as one of the most fundemental software in analyzing functional neuroimaging data, knowing how to use it - even moderately - has gained myself huge advantage in my research. Therefore, I present this post as a brief introductory guidance to readers who are new to SPM. Advanced readers may refer to SPM manual. Readers interested in knowning more about fMRI data analysis may also find the book Principles of fMRI and corresponding vedio lectures by Professors Lindquist and Wager, and Handbook of Functional MRI Data Analysis by Professor Poldrack et al, helpful.

Certainly, my experience with SPM is very limited; and likely I may present errors or omit important features of SPM in ten steps. I will update the following procedures regularly; if you have any suggestions, or criticisms, please send me an email to help me improve them.

Step 3: Open SPM in Matlab by entering spm in the command window. Based upon the data set you have (PET, EEG, or fMRI), click the corresponding icon (see Figure 1). Here, we choose fMRI, for example;

Step 4 (Display an image): First, download desired datasets, which are usually stored in  NIfTi (Neuroimaging Informatics Technology Initiative) files, saved in one *.nii file or a *.hdr and an *.img file. Functional images are saved as fM*.image; and structual images are saved as sM*.image. To display, click Display > file_location > *.nii > Done . We see the display of a structual image (see Figure 2).
Step 5 (Preprocessing): Preprocessing includes: motion correction; slice-time correction; coregistration; segmentation; normalization; and smoothing. These icons are in the top of the SPM menu (See the red frame in Figure 3). Let us take motion correction for example:  Realign > Realign (Est & Reg) > Double click Data <- X > Double click Session <- X   > in the pop-up window choose desired file to be motion corrected > Done > click the green bottom  (See Figure 4). If inputing files are sM*.image (similar for functional images), resulting files are located in the original filepath as meansM*.image and rsM*.image. Similarily, for slice-timing, the input image is rsM*.img and the output image is arsM*.img; for coregistration, the input images are reference image meansM*.img and source image sM*.img, while the output image is updated “source image”; for segmentation, input image is sM*.img and the output images are c1sMM*.img, c2sMM*.img, and sM*_seg_sn.img; for normalization, the input images are sM*.img and arsM*.img, and the output image is warsM*.img; and finally, for smoothing, the input images are warsM*.img and swarsM*.img.

## Sir R. A. Fisher on Brain Science: a Neo-fictional Interpretation

Last updated: June 1, 2016

Statistics may be regarded (i.) as the study of populations, (ii.) as the study of variation, (iii.) as the study of methods of the reduction of data.

– Sir R. A. Fisher, Statistical Methods for Research Workers

June 1, 1919.

My name is Ronald Aylmer Fisher. I am a researcher at Rothamsted Experimental Station (the director here, Dr. E. John Russell, calls it Rothamsted Research for short) in Harpenden, a small town in the St Albans City district in the county of Hertfordshire, England. Here at Rothamsted Research, I analyze data obtained from crop experiments during the day. At night, I am working on a few manuscripts on experimental design and gathering several statistical methods to analyse variance in field studies. Two weeks ago, I paid a trip back to Cambridge and had a long discussion with my old boys John Keynes, Reginald Punnett, and Horace Darwin, and have become very interested in studying data collected from the human brain. I would like to present a few thoughts sprung from our conversation.

I claim that the usefulness of statistics in neuroscience can be divided into three areas: the study of the brain in populations, (ii.) the study of variation of brain measurements in populations, and (iii.) the study of methods of the reduction of large brain data. I shall further expatiate these three areas by including statistical approaches with regards to data science development in neuroscience.

First, the study of brain data is to gain insights to understanding how the brain perceives, processes, stores, and output information, in populations, or aggregates of individuals, rather than of individuals. The term population in brain science refers not only to an aggregate of brain activity measurements from multiple subjects, but also to an aggregate of a single brain measurement repeated multiple times for one subject. The former indicates our recognition of variations of brain activities amongst different individuals, whereas the latter represents our appreciation that the object of studying single subject brain activities is not to attempt to achieve an individual result, but rather, we make our best effort to ensure our findings representative. There are significant merits in studying data containing measurements of multiple subjects and those containing multiple measurements of single subjects. One of the end goals of brain science is to make scientific progress on diagnostics, treatments, cures and management of brain disorders. In order to raise the findings we have about the brain to the rank of science, we shall make statistical arguments about properties of the brain in a large aggregates of individuals. In order to produce treatments that target at a particular individual, we shall make statistical arguments about properties of the brain for that individual, based upon a large aggregates of measurements of his/her brain. Understanding how the brain works across subjects allows us to apply these principles at the individual level, and to advance applications that achieve artificial intelligence by mimicing the way an average brain performs, such as neural networks computers. Understanding how the brain works at the individual level would assist us in understanding how a specific brain and its activities deviate from the average. It hence leads to scientific progress such as an introduction of personalized medical plans, and a usage of brain signals to identify a subject (See Constable et al and Wachinger et al, among others). With an advancement of data acquisition technology and the popularization of high-performance computers, we are obtaining brain data in an unprecedentedly high-resolution, rapid, and accurate manner. Yet, there are strides to make. We shall advance our understanding of how the brain works in different types of populations: infants V.S. adults, females V.S. males, etc., how the brain signals change across time, and how brain signals change according to different (visual, auditory, sensory, etc.) inputs. Furthermore, we shall reduce the errors caused by measurement and data processing, via improving and developing proper statistical and computing techniques. Additionally, we shall aim to increase the sensitivity of our study. It allows pharmaceutical companies to develop affordable medicine that would treat specific brain disorders for the marjority of patients.

Second, the brain is an extremely complicated organ stored in a blackbox. Despite the advance in brain science, little do we know about how information is processed in the box. For example, does the brain process information linearly, or more plausibly, non-linearly (but in which form)?; (b) there is a tremendous amount of variations amongst different brains in terms of sizes, volumes, shapes, etc; and (c) there is much variation in measuring brain signals. Whilist the first challenge is extensively tackled by physicists and computer scientists via the studying of dynamic systems, spiking neural networks, and other neural networks (e.g. recurrent neural networks; Boltzmann neural network; deep neural networks; adaptive neural networks; radial basis networks), statisticians working on neuroscience are actively seeking to solve the latter two. Once we have identified our goal in studying the brain in populations, it is a natural follow-up to study variations because the brains in populations display vairation in one of more aspects. We, nevertheless, are not interested in variation of the brain per se; rather we recognize that variation is an inevitably troublesome by-product delineating circumstances where repeated measurements of the brain deviate from the average. Therefore, while describing the absolute properties of the brain (via parameters, e.g. mean activation intensity of a region of the brain), we encompass them with variances to address their uncertainty (and confidence). The introduction of variances leads to two further areas of statistical studies in brain science: the study of frequency distributions, and the study of correlations. The frequency distribution may be expressed as a mathematical function of the variate (e.g. voxel-specific t-statistic), either (i.) the proportion of the population (regions, voxels, neurons, etc.) for which the variate is less than a given value, or (ii.) by differentiating this function, the infinitesimal proportion of the population for which the variate falls within any infinitesimal element of its range. In brain science, many frequency distributions are heavy-tailed - hence the study of them has some implication in studying certain financial models), and vice versa. On the other hand, we are not only interested in studying the variations of the parameters of interests at present, but also interested in estimating the quality and types of these variations. Especially, we are interested in examining the simultaneous variation among multiple variates. It, therefore, gives rise to the correlation analysis. For ultra-highdimensional brain data, however, a voxel-wise correlation analysis could be unmanageably troublesome. This leads to the following section.

The third usefulness of statistics in brain science is due to the practical need of reducing large bulks of data to a convenient amount that retains relevant information in the original data that our human minds (and our computer memories) are able to grasp, by means of a manageable anount of numerical values. How much data reduction, however, should we conduct? In all cases, it is possible to reduce data to a simple numerical form, or to an amount that our computers are able to efficiently handle, where, the reduced data are sufficient to shed light upon scientific questions the investigator has original in mind. In brain science, two widely-used approaches in conducting data reduction are (I) principal component analysis (PCA) and its variants and (II) assuming sparcity. The PCA method captures the marjority of the variation of the data. However, in brain science, oftentimes we have data with p ~ n or p » n, under which its estimates are inconsistent. There are a few papers on sparse PCA that have demonstrated subspace consistency. For example, Ma et al and Jung et al. The sparcity assumption is, in essense, to make p < n. It makes neurobiological sense in the following manner: the working brain consumes energy. At any given time, be it resting state or task state, only a portion of the neurons are activated so as to reserve energy. I had an amiable conversation with Professor Pien-Chien Huang, during which he mentioned that we human-beings do not dream in color (or at least have dreams less vivid and coloful). Schwitzgebel et al has an article discussing this. I conjecture (with absence of scientific evidence) that a part of the reason is the brain attempts to reserve energy while sleeping (so only the minimal amount of information is processed: e.g. the brain only recalls the outlines, orientations, movements, etc. of objects. But they are sufficient to distiguish one from another and form visual events) - statistically a natural way of conducting data reduction! Much of the exisiting and on-going projects with regards to brain decoding, such as facial recognition and dream decoding, (see decoding simple pictures: Haxby et al, decoding objects with edges and orientations: Haynes and Rees and Kamitani and Tong , decoding complex picutres: Gallant et al , decoding movies: Gallant et al , decoding intentions: Haynes et al , and decoding dreams: Kamitani et al ) are mainly conducted using one or more of the following approaches: PCA, LASSO (sparcity assumption), and Bayesian analysis.

## On Scientific Writing and Presentation

Last updated: May 1, 2016

If you would not be forgotten as soon as you are dead, either write something worth reading or do things worth writing.

- Benjamin Franklin

I have been long thinking about writing a blog that puts in order a few resources from which I have benefited immensely in my yet-immature-but-steadily-improving academic writing and presenting skills. On one hand, this project serves as a library to which I have a convenient access; on the other hand, I hope that some students like myself can benefit from reading these articles, essays, and blogs below. This is an on-going project; and the choice of resources purely subjective. Nonetheless, if you have any comment, or if you would like to recommend additional resources, please send me an email. Please check back every now and then as I update.

This aritical is recommended by Professor Andreas Buja as “an article everybody should read” on scientific reading. Here is a summary version of it.

The Elements of Style teaches concise and vigorous English writing.

Professor Jeff Leek gives good advice on effective presentation here.

As a student studying in the interdisciplinary field between statistics, computer science, and neurosicence, I have learned a great deal from reading the works of Sir R. A. Fisher’s, inter alia, the “Statistical Trilogy” : (1) Statisticcal Methods for Research Workers; (2) The Design of Experiments; and (3) Statistical Methods and Scientific Inference

In the Autobiography of Benjamin Franklin, Mr. Franklin taught us a few procedures to improve writing:

• Write short hints about each sentence (you could also outline the piece) and set it aside for awhile
• Rewrite the article in his own words
• Compare with the original, and
• Revise and improve your essay.

• Andrew Lee, MD, my former roommate, an ENT resident at the Johns Hopkins Hospital, and an excellent writer, suggests that the most effective thing to do to improve one’s writing skills is to read essays; and read them critically. “By critically”, Andrew elaborates, “I mean while reading, think why does the author choose that particular word, and what does that word do?” To highlight the importance of effective writing, I once asked Andrew to read an email of mine in response to a job application, which seemed to be on the verge of a declination at that time. He rewrote the email and got me the job. He did not change the context of the communication; but he carefully changed a few words, which improved the presentation drastically. Here is a list of resources he suggests to read:

• The Atlantic
• The Economist
• The New York Times Editorials
• Great Expectation, by Charles Dickens
• Crime and Punishment, by Fyodor Dostoevsky
• Catch 22, by Joseph Heller
• Wuthering Heights, by Emily Bronte
• Heart of Darkness, by Joseph Conrad
• Beloved, by Toni Morrison
• The Sound and the Fury, by William Faulkner

• Beautiful fonts can be found here.