Posts
-
Scalability of data processing
How can we make distributed computing more resilient, remove bottlenecks, and improve scalability?
We can often address these questions at the architectural design level, in which we plan the structure of our system and the high-level interactions between system components.
-
Hash functions for cryptography versus look-up
A hash function accepts an arbitrary sequence of bits, such as a string or file, and outputs a corresponding sequence of bits of fixed size. This output is known as the "hash" of the input.
-
Simple Git workflows for teams
There are several common workflows for managing projects using Git, and which one works best will depend on your team’s structure and the complexity of your project.
If you deploy frequently, you may benefit from a structure that maintains an always-stable branch to release from, while allowing developers to work on unstable features. If you don’t host your products on your own servers, but instead have customers download software to install themselves, you may have to maintain multiple release branches that you can apply hot-fixes to.
-
Types of failures in distributed systems
Failure recovery is an interesting problem in many applications, but especially in distributed systems, where there may be multiple devices participating and multiple points of failure.
It’s very educational to identify the distinct roles in a system, and ask for each one, “What would happen if that part of the system failed?”
-
Golang channels
The Go Programming Language has built-in communication channels, which provide type-safe one-way or two-way communication between processes. This can be very useful for concurrent programming, such as in master-slave and map-reduce programs.
Channels can be buffered or unbuffered, where buffered channels can store multiple messages up to a declared capacity. Once a channel is full, it blocks further writes until a process initiates a read operation.
-
Setting Vim colour schemes
The default syntax highlighting scheme is so-so, especially when it comes to the dark blue comments against a black background.

Fortunately, we can easily install our own. I prefer a colour scheme called Monokai to the pre-installed set, and the steps to install it are the same as for any other schemes.
-
Supervised learning
Supervised learning algorithms use an initial set of labelled data to "learn" a prediction model that can be applied to future data. Labelled data is divided into training and test samples, which are ideally independent and representative of the distribution of actual data, and the training phase must not be influenced by the test data in any way to avoid overfitting our model to our data.
If there isn't enough labelled data for us to split it into training and test samples and still obtain satisfactory results, then cross-validation can be applied to evaluate our models while training on all our data. If we evaluate a small number of models using cross-evaluation and pick the one with the lowest error, we can produce a fairly accurate and unbiased prediction model.
Supervised learning isn't a magic solution to every problem, and each algorithm makes some assumptions that must be true of your data for it to work well. However, it can be a powerful tool if you understand where and how to apply it.
-
Visualizing data in Matlab
Data visualizations are a useful way to condense a large amount of information, and represent it in a format that is easy to read and interpret.
Generating a few visualizations is often easy once you have the data, and can be a good way to explore a data set. You might stop once you find a useful format and put together a script or app to provide up-to-date views, or just gather enough context to decide what to look into further.
In this post I’ll show how to generate line plots, box plots, histograms and scatter plots for a simple data set in Matlab. This data was sourced from Google Flu Trends (updated October 2014 model for United States 2013 data)*.
subscribe via RSS