Benchmarking GAIA MR on Google cloud

I’ve recently had a chance to benchmark GAIA in Google cloud. The goal was to test how quickly I can process compressed text data (i.e read and uncompress on the fly) when running on a single VM and reading directly from cloud storage. The results were quite surprising.

Read more

Share Comments

Gaia Mapreduce Tutorial - part1

There are many Java-based mapreduce frameworks that exist today - Apache Beam, Flink, Apex are to name few.

GAIA-MR is my attempt to show advantages of a C++ over other languages in this domain. It’s currently implemented for a single machine but even with this restriction I’ve seen up-to 3-7 times reduction in cost and running time vs current alternatives.

Please note that the single machine restriction put a hard limit on how much data we can process, nethertheless GAIA-MR shines with small-to-medium size workloads (~1TB). This part gives an introduction about mapreduce in general.

Read more

Share Comments

Introduction to fibers in c++

A small introduction to fiber-based programming model based on Boost.Fibers library

Read more

Share Comments

Seastar - Asynchronous C++ framework

Lately, there are many discussions in the programming community in general and in c++ community in particular on how to write efficient asynchronous code. Many concepts like futures, continuations, coroutines are being discussed by c++ standard committee but not much progress was made besides very minimal support of C++11 futures.

On the other hand, many mainstream programming languages progressed quicker and adopted asynchronous models either into a core language or popularized it via standard libraries. For example, coroutines are used in Python (yield) and Lua. Continuations and futures are used extensively in Java. Golang and Erlang are using green threads. Callback based actor models are used in C and Javascript. Due to a lack of official support for asynchronous programming in C++, the community introduced ad-hoc frameworks and libraries that allow writing asynchronous code in C++.

I would like to share my opinion on what I think will be the best direction for asynchronous models in C++ by reviewing two prominent frameworks: Seastar and Boost.Fiber. This (opinionated) post reviews Seastar.

Read more

Share Comments

Implementing cheap and precise clock

The posix API for querying high-precision hardware clocks is clock_gettime. If one second precision is fine then time(nullptr) is your friend. Unfortunately, using precice clocks takes its price - they are more expensive CPU-wise.

Read more

Share Comments

Reloading data structures under high throughput

Suppose you have a multi-threaded server that serves tens of thousands read queries per second. Those queries use a shared data-structure or index that is mostly immutable during the server run with the exception of periodic index reloads. How do you implement data reloads in that server while keeping it live and kicking in production?

Read more

Share Comments

How to serialize integers into memory

Here is the analysis of a recent bug I’ve stumbled upon. My initial reaction was that the problem is in the compiler (or that “These are wrong bees”). Consider the code below. We copy 64 integers into a properly allocated destination buffer and yet, if compiled with -O3 switch this code crashes with segfault!

Read more

Share Comments

My first post

Hi it’s my first attemp at blogging. I’ve chosen Hugo with Icarus theme for the content generation. This blog is going to be published vi my github repository. Stay tuned!

Read more

Share Comments