## How to obtain sparse solutions

Consider the following problem :

$mimimize_w f(w)$   such that $||w||_0|| \leq c$

Where c is a user-specified and constant number. The function f(w) here maybe the empirical risk or log likelihood of the training samples. The constraint limits the scope of the parameters space in order to prevent overfitting. According to my best knowledge there are three main ways to obtain sparse solutions for the optimization problem (1)

1. Enforcing the parameter w to be nonegative
I think we have a basic fact in optimization which states that the optimal solutions lie on the boundary of the feasible region. We have two examples for this case:
A commonly typical example is SVM. In the dual form of SVM, the Langrange multipliers are nonnegative. Therefore the solutions tends to sparse. In this case the non-zero solutions are called support-vectors.
Another example that I meet in my graduation thesis is the Kullback-Leibler Importance Estimation Procedure (KLIEP) which appeared in []. In this work, authors represented the importance weights
$\beta(x)=\alpha_1 K(x,x^{te}_1 +... \alpha_b K(x,x^te_{b})$. Based on the data, we have to estimate the parameters $\alpha$ such that KL-divergence from the density function of test data $q(x)$ to the density function of training data multiplied with importance weights $\beta (x) p(x)$ is minimized. We will obtain a constraints convex optimization which can be solved by projected gradient ascent. The constraint $\alpha_i \geq 0$ ensure that the optimal solutions contain lot of zero coefficients.

In order to obtain sparse solutions we can add a term in the form $||w||_p$ in the original objective function with $0 \leq q \leq 1$, then we have:

$minimize_w f(w)+ C||w||_p$ (2)

Where C is the regularization parameter .

According to Bayesian statistics, if f(w) here is log-likehood, the optimization problem (2) is equivalent with maximum a posterior. We place Laplace distribution on the parameter w in the case of p=1, and a zero mean and an identify covariance matrix Gaussian distribution in the case of p=2. The paramaters in these distributions will lead to the appearance of parameter C, so C may be called as the hyperparameter. How to adopt  a possibly optimal paramter C will be discussed in the next post.

p=1 is a popular choice, but we should be care since the overall function is not differentiable at some $w_i =0$

Eg: L1-SVM, Lasso,..

## Current works for graduation thesis

Recently, I am spending almost my time  practising GRE and will take it on next March, 7.  However, I have tried to do a few works for graduation thesis because I don’t want to lose all this month for only English certificates. Currently, I am working on “Covariate shift problems in anomaly intrusion detection”. There has been already one work on this topic that is “Machine learning for intrusion detection, modelling the distribution shift”. We have some relevant terminologies here. The overall probelm called ” dataset shift”, but some articles may use different names such as ” concept shift” , “concept drift”. The term “dataset shift” first used on the book”Dataset shift in machine learning” of MIT press. The problem “dataset shift” can classified into smaller problems such as : selection bias, imbalanced/skew dataset and covariate shift. I will  give an introduction and explain some recent works to handle the “covariate-shift” phenonmenon as follows.

What is covariate-shift and does it  a really serious problem?

Different approaches to covariate shift

With the imbalanced dataset we have at least two approaches to solve this problem (one is reduce or add more data to balance the data, and another is to assign an importance or weight to each training data). However, according to my best knowledge we have only way to deal the “covariate shift” problem, that is reassign an importance in every instances of training set. Many different methods based on this idea. There are three conspicuous methods that are considerable : kernel mean matching and Kullback-Leibler importance  estimation precedure and kernel density estimation. The first and second method are indirect in the sense that we estimation the distribution of Pte(x)/Ptr(x) without calculating Pt(x) and Pte(x) seperately. In contrast, the third method is a direct approach that need to determine the distribtuon of Ptr(x) and Pte(x) then evaluate the quotient Pte(x)/Ptr(x).

## Các hướng nghiên cứu thú vị trong khoa học máy tính hiện nay

1. Tính toán lượng tử đoạn nhiệt (Adiabatic quantum computing) và thông tin lượng tử (quantum information)

Máy tính lượng tử tăng tốc độ tính toán của các thuật toán, ví dụ như thuật toán Grover sắp sếp danh sách n phần tử trong O(n^(1/2)). Đơn vị cơ bản của máy tính lượng tử là qbit(quantum bit). Có thể hiểu đơn giản thế này, với bit b bình thường thì b hoặc 0 hoặc 1, tức  nhưng với qbit thì ta có :  tức là ta có thể lưu trữ dữ liệu đa bit. Với các giới hạn vật lý, ta không thể có tình trạng là một điểm bất kì trong  .Điểm khác  biệt giữa  bit bình thường và quantum bit nằm ở 1 khái niệm trong vật lý là entanglement.

Máy tính vật lý đoạn nhiệt là 1 dạng cơ bản của máy tính lượng tử mà sử dụng đinh lý đoạn nhiệt để từ một  trạng thái không tác động ( non-interacting state) và xáo trộn nó thành một trạng thái tác động (interacting state) . Phương pháp sinh ra các bit lượng tử này là thực tế và có thể ứng dụng trong cuộc sống.

Còn về thông tin lượng tử, đã có nhiều thực nghiệm thú vị trong các lĩnh vực thị giác lượng tử và nanophotonics.

## Course projects in this semester

This semeter we are offered two course projects related to our major: computer networks that are network design and implementation an application of distributed system. I has been stuydied in communication and computer networks officialy but they are not my actual interest. That is a long story, however fortunately I put a lot of effort into researching machine learning, so I fulfill assigments by using machine learning techniques to build an application in these two fields. My guide suggest two solutions that are close to my background : anomaly detection using artificial intelligence for the first project and biometrics authentication for the second one. These ones are very deeply helpful in computer networks. Namely, anomaly detection can used to detect any abnormal attack and alert users if there any invasion into their host computers. In addition, biometrics authentication help to uniquely recognizing humans. I will go into detail for the latter one.

My course project for distributed system is “Non-linear manifold-based learning for face recognition”.

Fisrtly, we need an explicitly definition of what is a manifold and how it help to construct our problem. Informally speaking, manifold in differential geometry refers to a topological space that looks locally like Euclidean space. For each points in a manifold, there is a neighborhood that is topologically same as an unit ball in R^n. Readers will wonder how many neighborhood are needed in the above definition. In many learning algorithms you will see later the size of neighborhood will be often determined by empirical testing. There is always a trade-off between a smaller and bigger neighborhood we neeed to take in consideration when implementating manifold-based learning algorithms.Until now I have’nt been finish my assignments  due to procastination and some other reasons, but most importantly the perfectionism. I want to extend my three techniques ISOMAP, LLE, Kernel-PCA to handle  new unseen points. In machine learning, we call this out-of-sample case. Almost manifold-base learning algorithms operate in offline mode. If an unseen data point come to use, we must retrain all dataset plus this new one. It will be time-consuming, so an extension is indispensable.

There are two papers I am considering at present. The first one is ” A generalised extension of the out-of-sample problem in manifold learning” of Strange & Zwiggelar. This one is understandable and easily to implement in Matlab. The second one that is more arduous is “Spectral dimensionality reduction” of Begio et al. I have Matlab code of the technique in this paper, but to understand their approach is still a problem 😦

## A new combinatorial problem

I found this problem yesterday on the art of problem and solving site.  It has been a long time since I solved mathematical problems and it’s worthy to try to solve it.

Give a regular polygon with 2^n vertices. We color its edges by color or blue in some fashion. We implement the operation on each edge: change the color in this edge to red if the two neighbours  are in red color, or change it to blue if the two neighbours are of opposite.

Prove after 2^n-1 steps all edges of this polygon will turn to red color, and this does not hold with fewer steps

## Vật lý và tài chính

What have gases, random walks and Einstein got to do with banks and trading floors? Everything it seems !

Chúng ta có thể ngạc nhiên khi đầu tiên nghe thấy rằng Vật lý và tiền bạc có những điểm chung. Các trung tâm tài chính nổi tiếng trên thế giới như phố Wall thường có xu hướng tuyển những người làm các lĩnh vực khoa học khác nhau và Vật lý là phần nhiều ? Nhưng thực sự là như thế, quản lý tiền cũng như 1 thí nghiệm vật lý , phải xử lý những con số, và những đại lượng thay đổi. Và sự liên quan ở đây sâu hơn như thế.

Vật lý là về phát triển những sự mô tả chung- các mô hình toán học của thế giới bao quanh chúng ta. Các mô hình ở đây có thể mô tả các thể loại khác nhau của độ phức tạp như chuyển động của các nguyên tử trong khí hay động lực của các ngôi sao trong ngân hà. Người ta cũng nhận ra rằng các mô hình tương tự có thể áp dụng cho các hoạt động phức tạp trong thị trường tài chính. Thực sự hiện nay các ngân hàng

## Collection of quotes about entropy

“Nothing in life is certain except deathtaxes and thesecond law of thermodynamics. All three are processes in which useful or accessible forms of some quantity, such as energy or money, are transformed into useless, inaccessible forms of the same quantity. That is not to say that these three processes don’t have fringe benefits: taxes pay for roads and schools; the second law of thermodynamics drives cars, computers and metabolism; and death, at the very least, opens up tenured faculty positions”
Seth Lloyd, writing in Nature 430, 971 (26 August 2004).

“You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”
John von Neumann, in a response to Claude Shannon

“Every mathematician knows it is impossible to understand an elementary course in thermodynamics.”
V Arnold

“Thermodynamical entropy is no more objective than infomation entropy. Both are as subjective as my opinion on that issue.”
Pierre Dangauthier

“Glib, unqualified statements to the effect that “entropy measures randomness” are in my opinion totally meaningless, and present a serious barrier to any real understanding of these problems.”
Edwin T Jaynes

“Entropy is an anthropomorphic concept.”
Eugene Wigner

It follows from this that the idea of dissipation of energy depends on the extent of our knowledge. Available energy is energy which we can direct into any desired channel. Dissipated energy is energy which we cannot lay hold of and direct at pleasure, such as the energy of the confused agitation of molecules which we call heat. Now, confusion, like the correlative term order, is not a property of material things in themselves, but only in relation to the mind which perceives them. A memorandum-book does not, provided it is neatly written, appear confused to an illiterate person, or to the owner who understands it thoroughly, but to any other person able to read it appears to be inextricably confused. Similarly the notion of dissipated energy could not occur to a being who could not turn any of the energies of nature to his own account, or to one who could trace the motion of every molecule and seize it at the right moment. It is only to a being in the intermediate stage, who can lay hold of some forms of energy while others elude his grasp, that energy appears to be passing inevitably from the available tothe dissipated state.”
James Clerk Maxwell, “Diffusion”, Encyclopaedia Brittanica, 1878

## ICML discussion site

The International Conference on Machine Learning (ICML) of this year is happening now in Haifa, Israel.

You can joint to discuss and understand more about papers published in the conference at :http://mldiscuss.appspot.com/discuss/207

## Random permutations for randomized algorithms

In many applications such as implementation of  randomized/stochastics algorithms , you have to deal with generating random permutations.

Problem : Generate a uniformly random permuation a[i] of the sequence 1,2,….n.

A trivial effort we may easily to see  is using a brute force algorithm that for each i from 1 to n , we set a[i] to a number from  the set {1,2..,n} thus check whether the  terminal sequence we get is a permutation The probability we get a random permuation is $\frac{n!}{n^{n}}$

The second approach is more simple and required only O(n) steps is described as follow : We define a_{0} is the identity permutation(i.e a_k(i)=i) and a_k is constructed inductively as below: Let  b_{k}  is a random number drawn  uniformly from {k,k+1,..n} . a_{k}  is created by swaping the value at the k th postion and latex \b_k the position in the  a_{k-1}  permutation.

Pseudocode :

void randomPermutation(int a[], int n)

{ int temp;

for(i=0; i<n; i++)

{ int b= rand(i;n) ;

temp=a[i];

a[i]=a[b];

a[b]=a[i];

}

}

The permuation tree below is drawn to explain the operation of this algorithm applied for a permuation of four numbers. Every leaf is an permuation, so in case of n =4 there are totally 24(=4!) leaves in that tree.

The number 1 is described by the root node . So at the root we have four chances to go. That is equivalently to say that we have four opprtunities to swap for 1 to one in four number 1,2,3,4.

I

To this time we may  be curious why we can not choose the number b_k from {1,2,..n} .The reason is vey simple that finally we will get a  tree with n^n leaves and since n! is not divided by n^n, some permuations will have more chances to be selected than the others. That ‘s in contrast to the assumption that every permuation have equal probablities ( i.e uniform distribution)