Gillespie algorithm for stochastic simulations of signaling pathways – vectorization in MATLAB

Modeling of signaling pathways is an important part of cancer research, as it is essential to understand how proteins interact with each other to provide or impair a specific cell function. There are two main approaches to tackle that problem mathematically: 1) deterministic approach, in which we assume that the number of molecules is large and stochastic effects are negligible, 2) stochastic approach (typically using Gillespie algorithm), in which we model exact number of molecules and take into account inherent system noise. Today I’ll focus on the stochastic approach. You can find online plenty of packages to perform the stochastic simulations using Gillespie algorithm, but I haven’t seen any that is harnessing the vectorization in MATLAB. Today I will show that using vectorization we can increase the speed of the simulations about 100 fold. Such an increase in computational speed is very important, as more and more frequently agent based models incorporate signaling pathways on the cell level.

Let us start with basic information about the Gillespie algorithm. Using the algorithm we describe the change in the number of molecules from K different species (x1(t),…,xK(t)) that at each moment can undergo one of L different biochemical reactions. Each reaction has a specific propensity function, e.g. biding of two molecules from different species can have propensity r*x1(t)*x2(t) where r is biding rate. The propensities determine the time to next reaction (exponentially distributed) and which reaction occurs next (discrete distribution with propensity based probabilites). After a reaction occurs the system is updated according to stoichiometric matrix, i.e. matrix describing how many reactants are removed from the system and what products are created.

As a working example we will consider a very basic gene expression pathway – we will describe the number of mRNAs and corresponding proteins. Four considered reactions will be: transcription, translation, mRNA degradation, and protein degradation. Ok, let us start coding by defining considered pathway in the main script.

%DEFINING REACTION RATES
par.kr = 1/10;     %transcription rate
par.kp = 10;   %translation rate
par.gr = 1/150; %mRNA degradation rate
par.gp = 1/30;  %protein degradation rate

%DEFINING PROPENSITY FUNCTIONS AS A FUNCTION HANDLE
prop = @(x,par)([par.kr,...      %transcription, one mRNA molecule created
                 par.kp*x(1),... %translation, one protein created
                 par.gr*x(1),... %mRNA degradation, one mRNA molecule removed
                 par.gp*x(2)]);  %protein degradation, one protein molecule removed

%DEFINING INITIAL CONDITION, order [mRNA, Protein]
init = [0;1];

%DEFINING STOICHIOMETRIC MATRIX
%column corresponds to the reaction, row corresponds to the molecule
%order as in prop and init variables
stoch = [1 0 -1 0;... 0 1 0 -1];

%DEFINING TIME MESH FOR THE OUTPUT TRAJECTORY
tmesh = linspace(0,1000,100) ;

Now in a few lies we can code the whole Gillespie algorithm in a separate function file (without using any vectorization techniques first).

function trajectory = SSA(tmesh, par,prop,stoch, init )
   %tmesh - time mesh on which solution should be returned
   %per - parameters of the pathway
   %prop - definition of propensity functions
   %stoch - stochiometric matrix
   %init - initial condition for the pathway

   t = 0;                                          %current time
   state = init(:);                                %variable with current system state
   trajectory = zeros(length(init),length(tmesh)); %preparing output trajectory
   trajectory(:,1) = init(:);                      %setting initial value as the first element in trajectory
   cindx = 2;                                      %current trajectory index
   N = length(tmesh);                              %number of time points

   while t<tmesh(end)
      Q = feval(prop,state,par);          %calculating propensities of the reactions
      Qs = sum(Q);                        %total propensity
      dt = -log(rand())/Qs;               %generating time to the next reaction
      R = sum(rand >= cumsum([0 Q])/Qs);  %selecting reaction
      state = state + stoch(:,R);         %updating state
      t = t + dt;                         %updating time

      %writing the output
      while cindx<=N && t>tmesh(cindx)
         trajectory(:,cindx) = state;
         cindx = cindx+1;
     end
   end
end

and add the following lines in the main script to simulate and plot trajectory.

%simulating
traj = SSA(tmesh, par,prop,stoch, init );

%plotting
figure(1)
plot(tmesh,traj(1,:))
xlabel('Time, min')
ylabel('#mRNAs')

figure(2)
plot(tmesh,traj(2,:),'r')
xlabel('Time, min')
ylabel('#Proteins')

Below are the resulting plots after one stochastic realization of the whole pathway.

mRNA protein

So now we can simulate our simple pathway, but there is one problem: it takes about 3.5 seconds to generate trajectory presented in the above plot. If we want to simulate that pathway for each agent present in the system and we reach the level of 2000 agents, then simulating the pathway for each of the agents using simple for loop will take about 2 hours… What can we do to decrease computational time? We could of course use multiple CPUs – using 100 of them will allow to do the computations in about a minute. But who has access to 100 CPUs? Only few of us. What about using the vectorization option in MATLAB and modifying the SSA function to perform those 2000 simulations simultaneously? That is definitely worth trying. We need only to add additional argument to SSAv function – number of trajectories to generate, and modify few lines to do vectorized calculations:

function trajectory = SSAv(tmesh, par,prop,stoch, init, nSim )
   %tmesh - time mesh on which solution should be returned
   %par - parameters of the pathway
   %prop - definition of propensity functions
   %stoch - stochiometric matrix
   %init - initial condition for the pathway
   %nSim - number of simulations to perform

   tmesh = tmesh(:);   %reshaping mesh to be vertical vector
   t = zeros(nSim,1);  %current time for each simulation
   state = repmat(init(:)',nSim,1); %current state variable for each simulation
   trajectory = zeros(nSim,length(init),length(tmesh)); %preparing output trajectory
   trajectory(:,:,1) = state;%setting initial value as the first element in trajectory
   cindx = 2*ones(nSim,1);%current trajectory indices
   N = length(tmesh); %number of time points
   aux = 1:nSim; %

   while ~isempty(t)
      Q = feval(prop,state,par);         %calculating propensities of the reactions
      Qs = sum(Q,2);                     %total propensities
      dt = -log(rand(size(Qs,1),1))./Qs; %generating time to the next reaction
      P = bsxfun(@rdivide, cumsum([zeros(size(Qs,1),1) Q],2),Qs); %probabilities for each reaction
      R = sum(bsxfun(@ge,rand(size(Qs,1),1),P),2);                %selecting reaction
      state = state + stoch(:,R)';       %updating state
      t = t + dt;                        %updating time

     %writing the output
     update = t > tmesh(cindx);
     while any(update)
        %updating state
        iupdt = find(update);
        for i = 1:length(iupdt)
           trajectory(aux(iupdt(i)),:,cindx(iupdt(i))) = state(iupdt(i),:);
        end
        cindx = cindx+update;

        %removing finished simulations from the variables
        indx = cindx > N;
        if any(indx)
           cindx(indx) = [];
           t(indx) = [];
           aux(indx) = [];
           state(indx,:) = [];
           if isempty(cindx)
              break;
           end
        end
        update = t > tmesh(cindx);
     end
   end
end

We need also to make sure that in the main script that the propensity functions are defined in the vectorized manner (everything else remains the same).

%DEFINING PROPENSITY FUNCTIONS AS A FUNCTION HANDLE IN VECTORIZED MANNER
prop = @(x,par)([par.kr*ones(size(x,1),1),...
                 par.kp*x(:,1),...
                 par.gr*x(:,1),...
                 par.gp*x(:,2)]);

Below are the plots of trajectories generated using the vectorized function for nSim = 100.

mRNA100 protein100

Now the important question is how much time did it take to simulate those 100 trajectories. We know that using the simple non-vectorized code and a for loop it would take about 360 seconds. The vectorized function did it in 16 seconds! That is 22 fold decrease in computational time. Plot below shows how computational time changes with the number of trajectories to simulate.

comparizonTimes

Vectorized code generated 2000 trajectories in 62 seconds! So we don’t need those 100 CPUs to obtain reasonable time of computations.

I hope that this post will convince some people that MATLAB is not bad after all.