Thursday, July 4, 2013

Bayesian Latent models: the shortest explanation ever

I found this on wikipedia and this is presumably the best and shortest explanatation I ever found about latent variable models and especially these Bayesian non-parametric models:

  • The Chinese Restaurant Process is often used to provide a prior distribution over assignments of objects to latent categories.
  • The Indian buffet process is often used to provide a prior distribution over assignments of latent binary features to objects.
From the first definition, you clearly see the construction of your prior over the space of categories. You understand that you have an infinitely sized space where all the possible combinations of categories exist and you are building a distribution over this space which will then be used as a prior for the variables describing your objects.

I let you think about the second definition, but think about an infinite collection of labels you can put or not on each object wheter it has the feature or not (and maybe my explanation is not as clear as the definition from wikipedia)

Saturday, June 15, 2013

Google Summer of Code 2013

I'm mentoring 2 students for the Google Summer of Code 2013, with my colleague Nasos who is a regular contributor to the project.
Boost has 7 students this year and at Boost.uBLAS, we're happy to say we have 2 students among the 7 !

One will work on implementing missing functions in BLAS 1,2 and 3 and introducing CPU level parallelism like auto-vectorization and SSE instructions (by hand).
The other one will work on bringing parralelism at the core level but especially at the network level to make Boost.uBLAS one of the only general-purpose linear algebra library than can distribute computations over a network using MPI.

Creating a BTRFS filesystem on 2 disks

I know it has nothing to do with Machine Learning, AI or even C++ coding but think about it, it's also part of the job. You just received a massive dataset and you only have small hard disks. In my case I have 2 disks of 400Gb. I know it's small by today's standard and I don't want to bother with a lot of partitions and complex tree structure, especially because I need to store my massive dataset that requires more than... 400 G.

With modern Linux there are at least 2 solutions:
  • LVM the Logical Volume Manager
  • BTRFS a new filesystem that offers incredible features

I have been using LVM for years, so I decided to give BTRFS a try. Here is my simple setup:

  1. I have a desktop computer in which I've just added 2 hard disks of 400 Gb
  2. I want to group them like they would be a single 800 Gb disk
  3. I want RAID0 for speed. RAID0 will split up the data on the 2 hard disks at the same time, making my new hard disk twice as fast as one single hard disk (this is theory, in practice, it is not exactly twice as fast, but still, it is way faster anyway)
  4. I know RAID0 is not safe all the time and I want, at least, the metadata to be RAID1, i.e. to be mirrored on both disks.

BTRFS will do everything for you and it's surprinsingly simple:

  1. let's say my 2 hard disks are /dev/sda and /dev/sdb. It may vary in your case
  2. I create a filesystem with RAID0 and RAID1 for metadata as said before
    • mkfs.btrfs /dev/sda /dev/sdb
    •  
  3. I check everything is fine with
    1. btrfs filesystem show /dev/sda
    2. note that if you replace sda by sdb, you will have the same information because the 2 partitions are linked by btrfs
  4. I create a new directory where I want my disk to be mounted
    • mkdir /mnt/newdisc
  5. I mount it
    • mount /dev/sda /mnt/newdisc
    • or mount /dev/sdb/newdisc Again it's the same
  6. I add it to /etc/fstab for it to be mounted at boot time
    •  /dev/sdb /mnt btrfs defaults 0 1
  7. Et voila !