Monday, September 29, 2014

Bash prompt with colors, and git branch

So, you're working on several machines at the same time, travelling from directories to directories, and using git with multiple branches. And, like me, you can't remember where you are, which git branch you're in, because you have half a dozen or more terminals open on your screen.

Here is a little trick you can do with bash, change the prompt.
First of all, you have to download this nice little script called git-prompt.sh and save it in your home directory as .git-prompt.sh

Then, edit your .bashrc and add the line


source ~/.git-prompt.sh


wherever you want. It can be near the beginning for example. If you don't have .bashrc, search for .bash_profile instead.

Then search for the definition of the variable PS1 . And replace its line by the following:


export PS1='\h:\[\e[33m\]\W\[\e[0m\]$(__git_ps1 " (%s)")\$ '

If you can't find it, just add the line wherever you want but in any case, after the source command from above.

The variable PS1 defines your bash prompt as follow:
  • \h : print the hostname you are in
  • \[ and \] encloses ANSI escape commands. Here I use ANSI commands to change the color
  • \e is the ESCape character
  • [33m is for yellow
  • [0m is for "back to normal color"
  • \W is the basename of your working directory. Example /home/foobar/mydir will be written as mydir only
  • $(__git_ps1 " (%s)"): calls the __git_ps1 function which determines your GIT branch or nothing if you're not in a GIT-managed directory
  • \$ transforms into $ if you're a normal user or # if you're root


Wednesday, June 18, 2014

Boost Spirit and C++ 11

It's a little trick again to know about. This time it's about Boost Spirit and C++ 11. Boost Spirit is a C++ library to generate a parser, writing directly C++ code, unlike other parser generators like yacc, bison or ANTLR which have their own language. It's cool, very templated and produce fast code. And fun to use !

 If you want to compile the examples in the tutorial (like this one: http://www.boost.org/doc/libs/1_55_0/libs/spirit/example/qi/mini_xml3.cpp) with C++ 11, you will have a BIG error message, like hundreds of undecipherable lines of templates instantiation errors that are absolutely impossible to read if you're not a world expert, basically the author of the Spirit library.

First time I tried, I ended up with 100+ Kb of error messages ! I tried with different versions of my GCC compiler and clang++ from the LLVM project without any success.

After googling a lot, I finally found a nice little post describing the simple solution to make it work. At the beginning of your code, like the mini_xml3.cpp example, add this, before your #include statements:

#define BOOST_RESULT_OF_USE_DECLTYPE
#define BOOST_SPIRIT_USE_PHOENIX_V3


The first line will force the Boost libraries to use the C++11 decltype instead of the Boost result_of and the second line will ask Phoenix (another Boost library used by Spirit) to use the latest version V3 instead of the old V2 which is not C++ 11 friendly.

My compiler is gcc 4.8.2 and I use Boost 1.55. Your results might vary with different versions.

By the way, using Phoenix V3 is also needed if you intend to put lambda functions in the action part of your grammar rules (<- i=""> if you understand this function, it means you already read all the Boost Spirit tutorial
). So this trick solves 2 problems.





 

Thursday, July 4, 2013

Bayesian Latent models: the shortest explanation ever

I found this on wikipedia and this is presumably the best and shortest explanatation I ever found about latent variable models and especially these Bayesian non-parametric models:

  • The Chinese Restaurant Process is often used to provide a prior distribution over assignments of objects to latent categories.
  • The Indian buffet process is often used to provide a prior distribution over assignments of latent binary features to objects.
From the first definition, you clearly see the construction of your prior over the space of categories. You understand that you have an infinitely sized space where all the possible combinations of categories exist and you are building a distribution over this space which will then be used as a prior for the variables describing your objects.

I let you think about the second definition, but think about an infinite collection of labels you can put or not on each object wheter it has the feature or not (and maybe my explanation is not as clear as the definition from wikipedia)

Saturday, June 15, 2013

Google Summer of Code 2013

I'm mentoring 2 students for the Google Summer of Code 2013, with my colleague Nasos who is a regular contributor to the project.
Boost has 7 students this year and at Boost.uBLAS, we're happy to say we have 2 students among the 7 !

One will work on implementing missing functions in BLAS 1,2 and 3 and introducing CPU level parallelism like auto-vectorization and SSE instructions (by hand).
The other one will work on bringing parralelism at the core level but especially at the network level to make Boost.uBLAS one of the only general-purpose linear algebra library than can distribute computations over a network using MPI.

Creating a BTRFS filesystem on 2 disks

I know it has nothing to do with Machine Learning, AI or even C++ coding but think about it, it's also part of the job. You just received a massive dataset and you only have small hard disks. In my case I have 2 disks of 400Gb. I know it's small by today's standard and I don't want to bother with a lot of partitions and complex tree structure, especially because I need to store my massive dataset that requires more than... 400 G.

With modern Linux there are at least 2 solutions:
  • LVM the Logical Volume Manager
  • BTRFS a new filesystem that offers incredible features

I have been using LVM for years, so I decided to give BTRFS a try. Here is my simple setup:

  1. I have a desktop computer in which I've just added 2 hard disks of 400 Gb
  2. I want to group them like they would be a single 800 Gb disk
  3. I want RAID0 for speed. RAID0 will split up the data on the 2 hard disks at the same time, making my new hard disk twice as fast as one single hard disk (this is theory, in practice, it is not exactly twice as fast, but still, it is way faster anyway)
  4. I know RAID0 is not safe all the time and I want, at least, the metadata to be RAID1, i.e. to be mirrored on both disks.

BTRFS will do everything for you and it's surprinsingly simple:

  1. let's say my 2 hard disks are /dev/sda and /dev/sdb. It may vary in your case
  2. I create a filesystem with RAID0 and RAID1 for metadata as said before
    • mkfs.btrfs /dev/sda /dev/sdb
    •  
  3. I check everything is fine with
    1. btrfs filesystem show /dev/sda
    2. note that if you replace sda by sdb, you will have the same information because the 2 partitions are linked by btrfs
  4. I create a new directory where I want my disk to be mounted
    • mkdir /mnt/newdisc
  5. I mount it
    • mount /dev/sda /mnt/newdisc
    • or mount /dev/sdb/newdisc Again it's the same
  6. I add it to /etc/fstab for it to be mounted at boot time
    •  /dev/sdb /mnt btrfs defaults 0 1
  7. Et voila !

Tuesday, November 9, 2010

a graphical model of relationships between univariate distributions

Reading the excellent book Statistical Inference from George Casella and Roger L. Berger, I found a reference to an interesting work from Lawrence M. Leemis in 1986 that's worth reading. In fact, this paper has been updated in 2008 and presents a nice graphical model (but not really in the sense we all imagine now: just a graph) of univariate probability distributions and what it takes to go from one to another, like generalizations or specializations. The graph in this paper is really amazing !
Just for fun, the Graph:

Monday, August 16, 2010

Linear Algebra in C++

What ? Linear Algebra in C++, especially in Machine Learning, where almost everybody programs in Matlab ? Yes, because most of the time it's the language used in the industry. And not a bad one to be honest (please no language war here, I know many other languages are good, etc...). So, I'm the new manager of a Linear Algebra library called uBLAS. Yes, you're reading right, the one in the famous Boost libraries.

Boost is made of many libraries, focusing on templates programming and higher-order programming. One of them, uBLAS, is devoted to Linear Algebra. I created a companion website. It's one of the fastest library, however it lacks several things. We're working hard to improve that:
  1. documentation is was poor: I made a new one and still improving,
  2. no small SSE fast vectors, no GPU: we working already on this.
  3. I will change products prod(m1,m2)to a nicer m1*m2 directly.
  4. a new impressive assignment operator: it's easier and more powerful to fill in you matrices than Matlab now!
The new version 1.44 is out now, with a lot of improvements and it's just the beginning of the story. I invite everyone interested in linear algebra in C++ to join the project.

Tuesday, June 29, 2010

log-sum-exp trick

when I implement models with discrete variables (which actually happens more than one can think), I always end up estimating this value:
\[ V = \log \left( \sum_i e^{b_i} \right) \]

Why ? This usually happens at the denominator of a Bayes formula for example. I try to keep \(log\)-probabilities all the time so that not to have to deal with very small numbers and to do additions instead of multiplications. By the way, I was looking at the time and latency of floating-point instructions in the latest processors (like Intel Core i7 for example), and I realized that still in 2010, additions are faster than multiplications (even with SSEx and the like).

Therefore, use \(log\)

In this expression, \(b_i\) are the log-probabilities and therefore \(e^{b_i}\) are very small or very big yielding to overflow or underflow sometimes. A scaling trick can help using numbers in a better range without loss of accuracy and for a little extra cost as follows:
\[ \begin{array}{rcl} \log \left( \sum_i e^{b_i} \right)&=& \log \left( \sum_i e^{b_i}e^{-B}e^{B} \right)\\ ~ &=& \log \left( \left( \sum_i e^{b_i - B }\right)e^{B} \right)\\ ~ &=& \log \left( \sum_{i} e^{b_i - B} \right) + B \end{array} \]

And that's it. For the value of \(B\), take for instance \(B=\max_i b_i\).
So the extra cost is to find the max value and to make a subtraction.

Monday, June 28, 2010

Just for those of you who wants to know how to put formulas in Blogger, I used this link here : http://watchmath.com/vlog/?p=438

Pretty straighforward. It uses a public LaTeX server to render the formulas. Very pretty !
This is my first post on this blog. And to be honest, this is the first time I'm gonna try to blog my thoughts. So, I'll do it on what I like these days: Artificial Intelligence and Machine Learning.

The idea is to post thoughts, tricks, ideas, etc... In the hope people will read it and comment too.

And, oh yes, I just installed in function to include math formulas. I don't know if it works so let's try it now with a simple version of the Bayes formula:
\[ P(A|B) = \frac{P(B|A).P(A)}{P(B)}\]