September, 2020 | Coding for science

Matlab + Fminuit を使った非線形関数フィッティング

English version is here.

このブログで使われているプログラムのダウンロード・更新についてのガイドはこちら。

データを眺めた後でモデルを構築してそれがデータを説明していることを確かめる、というのはデータ解析の一つの理想的な終わり方のような気がします。しかし実際にやってみると作業が多くてなかなか骨が折れます。やる事はだいたい決まっているのに1つのツールではなかなかできず、プログラムも冗長になりがちです。今回の記事では非線形関数フィッティングをした上で更にその適合度(goodness-of-fit)を簡単に計算するツールを紹介します。

作ったモデルがデータに適合しているか確認するには、まず数式を用いてモデルを表現し、(多くの場合非線形の)その式をデータにフィッティングし、適合度の検定をします。以前 Curve Fitting Toolbox を持っていなかったときにこれらの作業をするのはとても困難でした。素のMatlabで非線形関数フィッティングに使える関数は ‘fminsearch’ だけでした、’fminsearch’はフィットしたパラメータの誤差の解析を行わないので、パラメータの解釈もそのパラメータに対する検定等も面倒になります。

なんとか非線形関数フィットが便利にできる関数がないものか探したところ、Fminuitというツールを見つけました。FminuitはCERNで開発されたMinuitをMatlab用にラッピングしたものです。詳しい説明は本家のページに譲りますが、かなり強力なツールです。このプログラムは数十年にわたってCERNでのデータ解析に活発に使われており、毎年ペタバイト単位で集まってくるデータを捌くのに使われています。Fortranを基礎とするコードは堅牢かつ高効率かつ多機能で、それでいてシンプルなコマンドで柔軟に使うことができます。本家でWindowsとLinux用のバイナリを配布しているので、私のGithubではmacOS用(Intel Mac)のみを置いておきました。Mac以外を使っている方はバイナリを本家からダウンロードしてくるか、コードを落としてきてコンパイルし、Matlabのパスに入れてください。

もちろん Fminuit の時点で既に強力なツールなのですが、以前紹介した ErrorNum クラスを用いて更に使い易くしようと試みました。’MinuitFitEN’ という関数を書いたのですが、実例を見る前に引数の確認のためヘルプを見てみましょう。

function res = MinuitFitEN(func, initp, x, y, varargin)
 
     func - fit funciton handle func(p, x) % 関数ハンドルを引数として渡します。
     initp - (nPar vec) estimated inital parameters % パラメータの初期値。
     x - (any) 2nd argument of func % func の2番目の引数に該当するものをここに入れます。
     y - (any) ErrorNum data to fit including error % フィット対象のデータを ErrorNum にして入れます。
 
     -- from here optional parameters (default) % オプション群です。括弧内はデフォルト値です。
     'Extradim' - (scalar) extra degree of freedom to subtract (0) % もし束縛条件などがあって、余計に引いておきたい自由度がある場合使います。
     'LowerLimit' - (nPar vec) lower bound of the parameters (false) % パラメータの下限値がある場合に使います。
     'UpperLimit' - (nPar vec) upper bound of the parameters (false) % 上限値も同様です。下限を設定したら上限も設定してください。
     'StepSize' - (nPar vec) step sizes of the parameters (false) % フィッティングのステップサイズを指定したい場合に使います。
     'minfunc' - (String) function to minimize ('PoissonNLL') % 最小化する関数です。デフォルトはポアソン分布です。カイ二乗フィットをするときは'Chi2'と書いてください。
     'command' - (String) command to give to fminuit % fminuitに渡すコマンドを指定します。minuitのマニュアルを参照してください。
                 ('set pri -1; scan; minimize; simplex; improve; minos')

この関数を使うと非線形関数フィットが適合度検定も含めて比較的簡単にできるようになります。まずはフィットするデータを準備します。

x = 0:0.2:3;
n_obs = 30; % 測定の回数に該当するものです
sigma = 5; % データの標準偏差

% フィッティングを用いて以下のパラメータを推定します
base = 2; % 指数関数の底
offset = 3; % y軸のオフセット

% データを構築します
% 最後の項で測定誤差を加えています。
y_obs = repmat(base.^x, n_obs, 1) + offset + sigma * randn(n_obs, length(x));
y_err = ErrorNum.create(y_obs, 1); % ErrorNum.create は 標準誤差を計算します。

figure(101);
plotEN(x, y_err, '.', 'capsize', 0); % 最後に EN とつく関数は ErrorNum 用の関数です。

これを実行すると、このようなデータが生成されます。だいたい指数関数ですが、測定誤差(今回のは人為的に入れられたものですが)のせいで値が上下にふらついています。これに対して関数フィッティングをやってみましょう。

powfunc = @(p, x) p(1).^x + p(2); % これがモデル関数です。

% パラメータに適当な初期値を与えます。真値に近いほどフィッティングが成功しやすくなります。
base_guess = 3;
offset_guess = 1;
initial_p = [base_guess, offset_guess];
fitresult = MinuitFitEN(powfunc, initial_p, x, y_err, 'minfunc', 'Chi2');

最初の行で無名関数を作ってハンドルを手に入れています。もしモデル関数用の.mファイルが既にある場合第一引数を ‘@関数名’ としてハンドルを渡してください。フィッティングに必要なコードはこれだけです。 MinuitFitEN の最後の引数2つはフィッティングのタイプ(最小化する量)を示しています。今回は誤差がガウス分布なのを想定しているので、カイ二乗フィット’Chi2’を選択しています。私が個人的にニューロンの発火数に対する解析をよくやるので、デフォルトの設定がポアソン尤度関数’PoissonNLL’になっていますが、ANOVA (分散分析)が適用可能な解析の場合は’Chi2’の方を選択してください。(‘Chi2’の方がデフォルトとして適切な気もするのでこちらをデフォルトに変更するかもしれません。)

さて、フィッティングの結果を見てみましょう。

>> fitresult.p
ans = 
ErrorNum:
2.07 ± 0.06,  2.8 ± 0.3

MinuitFitEN は、推定されたパラメータ(p)を含む構造体を返します(誤差もちゃんと計算されています)。この誤差はANOVAにおける 1 標準偏差なので、だいたい68%の確率で真値がこの誤差の範囲に入っているはずです。さきほど設定した底の値は 2 なので、少しだけ外れています。オフセットは 3 なので誤差の範囲に入っています。どうやらちゃんと動いているようです。パラメータ推定をするときにいつも誤差を計算してくれるのは安心感があります。出力が ErrorNum なので、桁数の調整なども必要なく、適切な有効数字で返してくれます。

さて、先ほども述べた通り MinuitFitEN は関数の適合度も計算してくれます。

>> fitresult.redchi2
ans =
    1.5140

>> fitresult.redchi2sig
ans =
    0.9033

‘redchi2’ は reduced chi-squared の略で、カイ二乗を自由度で割った値になります。詳しい説明はWikipediaに任せますが(英語版になります。日本語の説明はありませんでした…。)、この値が1に近ければフィッティングがうまくいっているということになります。小さすぎるときはオーバーフィット(あるいは誤差の過大評価)、逆に大きすぎるときはアンダーフィット(あるいは誤差の過小評価)していると解釈できます。

1に近いといっても、どのくらい近ければ許されるのか明確ではないので、このカイ二乗の値をカイ二乗分布と比較して、どの程度稀なオーバーフィットやアンダーフィットなのかを調べられるようにしたのが ‘redchi2sig’ の値になります。これは適合度を評価する際のp値のようなものです。仮にこの値が0.99の場合は、フィッティングがうまくいっていると仮定して1%程度の確率でしか得られない値になっている(したがって、アンダーフィット(モデリングがうまくいっていない)の可能性が高い)、というように解釈します。モデリングがうまくいっていることを確かめるためにはこの値があまり極端な値をとらないことを示すことが必要になります(状況にもよりますが、0.05 から 0.95 あたりの間に入っているのが目安かと思います。p値を使うときと同じような注意が必要になります。)

これらの情報はフィッティングがうまくいっていることを確認するために大変有用な情報で、私自身も過去に書いた論文(Ito et al., 2017; Ito et al., 2020)で積極的に使用しています。しかし、敷居の高さもあってか一部の学問領域を除いて浸透しているとは言い難い状況です。これから書く予定の記事の中でも、これらの統計は使用していくことになると思いますので、そのときにまた具体例から適切な活用方法を掴んでいただけると幸いです。例に漏れず、これらの関数も隠れた高度な使い方がたくさんありますので、それらも紹介していけると嬉しいです。もし使ってくださる奇特な方がいらっしゃる場合は、それまでヘルプを参照して使用方法を確認していただけると有り難いです。

Nonlinear function fitting in Matlab using Fminuit

Japanese version is here.

Please follow this instruction to get my software on your computer.

A good ending of data analysis is making a model that explains your data. Typically it requires nonlinear function fitting. In Matlab, unless you have toolboxes, you will be left with ‘fminsearch’ function with limited capability. It is nice if you can do more sophisticated nonlinear fitting without a toolbox. In this post, I’ll introduce you to a very nice external function, ‘fminuit’, that does exactly that.

Fminuit is based on CERN‘s function fitting library Minuit. You can see a full explanation of what Fminuit is on its website. Briefly, Minuit is a program that is used in CERN for decades in order to fit high-energy physics models to petabytes of data collected there. It is very robust, efficient, feature-rich, and highly controllable. Fminuit is a Matlab wrapper of Minuit. My git repository includes only a binary version of fminuit for 64-bit Intel Macs. Windows or Linux binary are available from the Fminuit page, so please download and put them to your Matlab path. (You can also compile it.)

Fminuit itself is already pretty strong software, but I further wrapped it by using my ErrorNum class to make it even more user-friendly. Let’s take a look at the beginning of the help of my ‘MinuitFitEN’ function.

  function res = MinuitFitEN(func, initp, x, y, varargin)
 
     func - fit funciton handle func(p, x)
     initp - (nPar vec) estimated inital parameters
     x - (any) 2nd argument of func
     y - (any) ErrorNum data to fit including error
 
     -- from here optional parameters (default)
     'Extradim' - (scalar) extra degree of freedom to subtract (0)
     'LowerLimit' - (nPar vec) lower bound of the parameters (false)
     'UpperLimit' - (nPar vec) upper bound of the parameters (false)
     'StepSize' - (nPar vec) step sizes of the parameters (false)
     'minfunc' - (String) function to minimize ('PoissonNLL')
     'command' - (String) command to give to fminuit
                 ('set pri -1; scan; minimize; simplex; improve; minos')

You can run a nonlinear function fit by writing as simple as the following example. First, let’s construct the data to fit.

x = 0:0.2:3;
n_obs = 30; % number of observation
sigma = 5; % STD of the data

% Through the fit, you need to estimate these variables
base = 2; % base of the power function
offset = 3; % offset from 0

% Construct your data based on the parameters above
% Note the last term is adding errors in observations.
y_obs = repmat(base.^x, n_obs, 1) + offset + sigma * randn(n_obs, length(x));
y_err = ErrorNum.create(y_obs, 1); % ErrorNum.create calcualtes SEM

figure(101);
plotEN(x, y_err, '.', 'capsize', 0); % EN functions are for the ErrorNum class.

This yields a figure like this. As expected, you get data points that roughly obey the power function, with some fluctuations. Now, let’s fit a function to it.

powfunc = @(p, x) p(1).^x + p(2); % this is your model function to fit the data.

% you need to guess the initial parameters
base_guess = 3;
offset_guess = 1;
initial_p = [base_guess, offset_guess];
fitresult = MinuitFitEN(powfunc, initial_p, x, y_err, 'minfunc', 'Chi2');

In the first line, I’m defining an anonymous function as my model. If you have a .m file, you can get the handle by doing something like ‘@function_name’. Now fitting is done. The last two arguments for MinuitFitEN indicate the type of fitting (or a function to minimize). This time, I am doing a chi-square fit, so I’m choosing ‘Chi2’. As I’m doing a lot of analysis on spikes of neurons, the default is set to ‘PoissonNLL’, or Negative Log-Likelihood function for the Poisson distribution. In applications where ANOVA works (like this example), ‘Chi2’ should be the choice.

Let’s take a look at the result of parameter estimation.

>> fitresult.p
ans = 
ErrorNum:
2.07 ± 0.06,  2.8 ± 0.3

The output of the MinuitFitEN function is a structure that contains the estimated parameters with errors. The errors are one sigma of ANOVA, so you have a ~68% chance of having your true value within this error. The first value should be 2, but it’s slightly off. The second value is nicely within our expectation, 3. The fit seems to be working! It is very nice to always have the error attached to your results. The output is again, an instance of the ErrorNum class, so we get a reasonable text output without adjusting anything.

MinuitFitEN also calculates goodness-of-fit statistics. They are stored in these variables.

>> fitresult.redchi2
ans =
    1.5140

>> fitresult.redchi2sig
ans =
    0.9033

‘redchi2’ stands for ‘reduced chi-squared’, I wouldn’t give a detailed explanation of this value. Please refer to Wikipedia for its definition and usage. Here, I just say that if this value is close to 1, you are doing a reasonable fitting. If it is too small, you are overfitting (or overestimating the error); if it is too large, you are underfitting (or underestimating the error).

‘redchi2sig’ gives you the significance of the goodness of your fit. It is like a p-value for the goodness-of-fit. For example, if it is larger than 0.99, you have a reduced chi-squared value that a perfect fit would encounter only 1% of the chance. To say that your model explained the data well, you want to have a reasonable value (e.g. 0.05 < redchi2sig < 0.95).

These are important statistics, and I extensively used these values in the last two papers (Ito et al., 2017; Ito et al., 2020) that I have written. (These papers have real-world examples of function fitting.) However, these are also difficult concepts to grasp if you are a beginner of function fitting. In future posts, I will demonstrate more cases to show how to use these values. I will also demonstrate more advanced usage of this function (such as setting parameter boundaries, using more complicated functions, etc…), but for now, please refer to its help to see how to use it.

誤差伝搬・解析を簡単にする ErrorNum クラス(発展編)

An English version is here.

前回の記事では ErrorNum クラスの基本的な使い方を解説しましたが、今回は一歩進んだ内容を解説します。内容は次の3点です。

共分散がある変数同士の演算
値の統計的有意性の検定
任意の関数による誤差の伝搬

共分散がある変数同士の演算

2つの変数が独立でない場合、誤差に相関が生じることがあり、単純な誤差の伝搬をすると適切な結果を得られないことがあります。具体的には次の例のような場合です。

>> x = 4 + randn(100, 1);
>> y = x + 0.5 * randn(100, 1);
>> figure;plot(x, y, '.');
>> axis([0, 8, 0, 8]);

この結果を図にすると、

となりますが。一つの点を一つの測定値とすると、xの誤差とyの誤差が相関している様子がわかると思います。xが大きいときはyも大きく、xが小さいときはyも小さくなります。このような場合、次の図からもわかる通り相関によって x + y の誤差は x – y の誤差よりも大きくなります。

この共分散を考慮に入れずに計算した場合、誤差の計算は誤ったものになってしまいます。

>> x_en = ErrorNum.create(x, 1);
>> y_en = ErrorNum.create(y, 1);

>> x_en + y_en % ErrorNum は単純な二乗和平方根を計算します。
ans = 
ErrorNum:
8.3 ± 1.5

>> x_en - y_en % こちらの誤差の方が小さいはずですが…
ans = 
ErrorNum:
0.0 ± 1.5

これを正しくするにはあらかじめ計算した共分散の値を ErrorNum の計算に教えてやります。

% 共分散を計算してそれを考慮した加算(covadd)をさせる
>> cov_xy = cov(x, y); % cov は分散"行列"を返すので、cov_xy(1, 2) か (2, 1) 要素を使います。

>> x_en.covadd(y_en, cov_xy(1, 2))
ans = 
ErrorNum:
8 ± 2

>> x_en.covsub(y_en, cov_xy(1, 2))
ans = 
ErrorNum:
0.0 ± 0.4

x – y の誤差が 1.5 から 0.4 に減少しています。上で y の値を作るときに x に 0.5 * randn を加えて構築したので、x – y が 0.5 に近い誤差を持つ、というのは整合性がとれています。

値の統計的有意性の検定

値を計算しても多くの場合は、統計的有意性の確認をしないと意味を持ちません。ErrorNum クラスは ANOVA (analysis of variance) ベースの簡単な統計テストをしてくれます。

%% ANOVA テスト (0 との 比較をし、p値を返します。)
>> a = ErrorNum(3, 1);
>> b = ErrorNum(2, 1);

>> a.sig % 最も簡単な記法
ans =
    0.0027

>> sig(a) % これもいけます
ans =
    0.0027

>> sig(a - b) % これもいけますが…
ans =
    0.4795

>> (a - b).sig % この記法は使えません。
Error: File: ErrorNum_demo_advanced.m Line: 34 Column: 8
Invalid use of operator.

Matlab でメソッドを使う記法はドットを使うものと、関数のように書く方法の2つがありますが、演算した結果に適用したい場合、ドットを使う方法はエラーを返します。これは ErrorNum の仕様というよりは、 Matlab 自体のクラスの扱いかたによるものだと思います。

任意の関数による誤差の伝搬

四則演算などの単純な計算なら誤差の伝搬も簡単ですが、他の演算(関数など)に誤差の伝搬を適用したい場合はより複雑な計算が必要になります。完全とは言えませんが、ErrorNumはそのような事例にも対応しています。次の例を見てください。

>> x = ErrorNum([1, 2, 3], 0.1 * ones(1, 3))
ErrorNum:
1.00 ± 0.10,  2.00 ± 0.10,  3.00 ± 0.10

>> y = x.convert(@log, @(t) 1./t)
ErrorNum:
0.00 ± 0.10,  0.69 ± 0.05,  1.10 ± 0.03

convert の第一引数は適用したい関数のハンドルで、第二引数はその導関数のハンドルです。ほとんどの Matlab 関数は ErrorNum には適用できませんが、このように導関数を与えてやることで誤差の伝搬も考慮した計算が可能になります。(xの値に応じてyの変分も小さくなるので誤差が縮んでいるのがわかると思います。) 誤差の伝搬は線形近似を使って行われているので、誤差が大きくなって関数の非線形性が無視できなくなる場合この方法はあまり好ましくありません。

誤差伝搬・解析を簡単にする ErrorNum クラス(基礎編)

An English version is here.

コードのダウンロード・更新についてのガイドはこちら。

データ解析において誤差の計算や適切な使用は大変重要ですが、計算する全ての値についてきちんと誤差の伝搬を考えてコードを書くのは面倒でもあります。ErrorNumクラスは、これらのことをほとんど意識させることなく計算してくれます。早速使い方を見てみましょう。

% 誤差を含む値が2つ(en1, en2)用意します。
>> en1 = ErrorNum(5, 0.3)
en1 = 
ErrorNum:
5.0 ± 0.3
>> en2 = ErrorNum(2, 0.2)
en2 = 
ErrorNum:
2.0 ± 0.2
% ErrorNum同士の四則演算では、誤差が正しく伝搬されます。
>> en1 + en2
ans = 
ErrorNum:
7.0 ± 0.4
>> en1 * en2
ans = 
ErrorNum:
10.0 ± 1.2
>> en1 / en2
ans = 
ErrorNum:
2.5 ± 0.3

以上の計算は、クラス内部で四則演算の際の誤差の伝搬ルール(日本語版がみつかりませんでした…)を用いて行われています。計算結果は、有効数字が何桁になるかを誤差から判断して適切な桁数を表示します。(有効数字を無視して結果を過剰な精度で記述してしまうのは初学者にありがちなミスです。)

% 誤差に応じて適切な桁数を報告します。
>> ErrorNum(1.2531, 0.2)
ans = 
ErrorNum:
1.3 ± 0.2
>> ErrorNum(1.2531, 0.02)
ans = 
ErrorNum:
1.25 ± 0.02
>> ErrorNum(1.2531, 0.002)
ans = 
ErrorNum:
1.253 ± 0.002

まだ実装されているのは少数ですが、 sum や mean などの基本的な関数もErrorNumに対して使うことができます。

>> mean(m1)
ans = 
ErrorNum:
4.0 ± 0.2,  3.0 ± 0.2
>> sum(m1, 2) % 加算する次元を指定する引数も使えます。
ans = 
ErrorNum:
9.0 ± 0.4
5.0 ± 0.4

誤差を直接指定してインスタンスを生成する代わりに、データから計算させる方法もあります。(この方法をとると、標準偏差ではなく、標準誤差を計算します(標準偏差をsqrt(n)で割ったもの。))

% この生成方法では、誤差を計算する次元(この例では2)を第二引数に指定する必要があります。
>> data1 = [5.3, 5.2, 5.1, 5.8, 5.5]; ErrorNum.create(data1, 2)
ans = 
ErrorNum:
5.38 ± 0.12
>> data2 = ErrorNum.create(randn(30, 5) + 5, 1)
data2 = 
ErrorNum:
4.76 ± 0.15,  4.60 ± 0.18,  4.9 ± 0.2,  4.99 ± 0.17,  5.14 ± 0.19

次の記事ではErrorNumクラスのもう一歩進んだ使い方を解説します。

当ブログの Matlab プログラムの使い方

An English version is here.

当ブログに出てくるプログラムを使用するにはコードをダウンロードし、 Matlab パスに加えてください。手順は以下の通りです。

まずは Github からコードをダウンロードします。

$git clone https://github.com/shixnya/code-for-science.git

コマンドライン環境が整っていない場合は Github のページから zip ファイルをダウンロードすることもできます。

ダウンロードができたら(zipは展開してください)コードが格納されているディレクトリ(フォルダ)を Matlab パスに追加します(やり方はこちら)。

コードは更新される場合があります。gitを使っているならコマンドラインから “git pull” で簡単に更新できます。zipで落とす場合はその都度手動で更新する必要があります。