On this page

Group by


This as name suggests, gpby (groupby), which is computed continuously and is always ready to be consumed.
Every processing with the db for stream is always updated for every single event and this enables us to deal with
continuously running run time queries.
This is super efficient, fast and clean way for continuous running analysis

Let’s look at each one now at a time;

{"gpat":["a", "b"], "iatr":"c", "stat":1, "gran":3600, "swsz":86400, "kysz":48}

This tells the db to groupby (a, b) for input attribute c, set the keysize as 48 (this is to specify max length of the gpby name) for granularity “gran”: 3600 with sliding window size of 1 day (86400 sec) and finally “stat” dictates what to do here,

for ex; “stat”:1 means count.

This allows db to keep computing gpby for c by (a, b) every hour. so in typical sql language sense, it’s select c from table1 groupby a, b

Note: gpat -> is groupby attributes in typical sql query. iatr -> attribute on which aggregation occurs stat -> what kind of aggregation we want? count(1), unique count(2), running stats(3) etc… Pls note, count happens for string and fixed type attribute, unique count for string and running stats for fixed type (long, double) gran -> what’s the granularity for the aggregation?

For ex; aggregated value for 300 secs will keep the aggregated value for 5 min and several 5 min data will be there swsz -> How long you would like these aggregated data to be there? 86400 sec would mean keep it for 1 day kysz -> each aggregated data is stored with composite string as key.

The composite string is formed as follows; nts:attr1:attr2 Here attr1 and attr2 the attributes that appear in gpat array i.e. groupby attributes. The order is followed as it’s defined in the gpat array. nts means nth time stamp tick since begining of epoch. It depends on gran value.
Therefore, if gran is 60 sec, the nts is nth minute since epoch

For ex; if we replace a with category id, b with page id and c with visitor id, then this gpby scheme tells “count num of visitors groupby category and page in one hour” etc.

{"gpat":["a"], "iatr":"d", "stat":2, "gran":300, "swsz":864000, "kysz":32}

This one is also similar except it tells do unique count (stat:2)

So for example, it could be unique visitor count every day etc.

Note: It’s is self evident but good to emphasise that all these computations are happening at run time for every single event.
This allows us to do run time continuous pattern identification (such as anomaly detection) etc. but also to return the data instantly when it comes to query as the post processing for data won’t happen and data is already in ready to be consumed state.
This is true for all other computations as well in BangDB


Was this article helpful to you? Yes No