Architecture

DB Schema

The basic problem of real time event monitoring/analysis is to store a large number of events, and be able to access them efficiently for analysis. For that reason, a carefully designed hand optimized database schema is in order.

MySQL

Since the event_log table will store hundreds of thousands to millions of events, it is important that the table, be designed for maximum storage efficiency as well as retrieval efficiency. When developing this I looked into using something like PostgreSQL, which after some investigation would appear to have better performance primarily in the through something like parallel queries. On the other hand, I have found that knowledge of configuration/the behavior trumps any particular new solution. For this reason, I made the decision to initially use MySQL, as I have over 17 years experience administering and developing against this system.

For example, I know the way MySQL stores rows in a table such as event_log, the event_data field, where the unstructured event_data is stored, is stored separately from the main table data. So by omitting event_data from a SELECT statement, I am merely querying a number of 34 byte integer records, which results in much quicker access time.

No constraints

I'm sure for certain applications, constraints are absolutely necessary to insure database integrity, and Django is particularly helpful at making these straightforward using ORM. But you know the first rule of analyzing data is discard 99% of the data. I'm not sure how much benefit they would provide to this particular application but they reduce flexibility and have some impact on performance. The idea behind the event_log is to provide access to unstructured data anyway.

I experimented a bit with using the JSON column support, but found that using LONGTEXT was faster. If you want to be hardcore, you can easily modify the serialization to support protobuf, etc. You can easily change the column type to LONGBLOB, if you want to serialize to binary. To minimize dependencies as much as possible, only the default json library that comes with Python 3.6 is used.

Write performance

There are a lot of older articles that describe MySQL as being slow on commits. I believe this came from the time where MySQL was running on disks on dedicated servers, when the database would (correctly) wait until the record had been read back by the disk before finishing the commit. Maybe this was because 10 years ago, most people used the MyISAM table engine which didn't even support transactions.

I have a hard time believing that now, when the average MySQL installation is deployed on a VPS using SSD's, and InnoDB has been the default engine for years, that this is still the case. Furthermore, MySQL is now owned by Oracle, the same people that own ZFS. As the manual for MySQL 5.6 suggests under High Availability, you can use ZFS to improve performance of the database by tuning the file system in much the same way as some of the more interesting features of PostgreSQL do on the RDBMS level.

In my runbook for PicoEvent, I have instructions for using ZFS on Ubuntu 18 LTS to store the MySQL DB. I believe the utility of ZFS is limited on virtual machines, for top performance ZFS should be run on a dedicated machine with plenty of RAM, and if you have enough demand for storage, in a hybrid configuration with a SSD (possibly NVMe) for the system, L3 cache, and ZIL.

Speeding up writes is far less straightforward than speed up reads. The ZIL is a really easy way to do this, interestingly enough the excellent ZFS Administration Manual (written in 2005) recommends using the ZIL to protect against "power outages", at the same time allowing asynchronous writes (the commit is logged to the SSD until it can be fully committed to disk.)

Now that SSD's are really cheap, there may not be any reason to do this unless you need to store enough data that a hybrid storage solution is indicated.

Redis

To mitigate the relatively low parallel read performance of MySQL, the EventLog uses Redis as an LRU cache.

Thinking about it now, I would imagine Redis could have excellent performance using a ZFS storage pool, I might do run some benchmarks on a hybrid storage FreeBSD machine to see if I can find some data to back that up.

System Architecture

This is how PicoEvent should be deployed in the primary configuration of providing BI analytics to a Django site.

All calls are made to the abstracted EventLog class. New events are first committed to MySQL and then subsequently serialized and placed in the Redis cache. Django User Data is not stored in the PicoEvent MySQL instance, but retrieved using the django-rest-framework, and stored in the Redis cache.

Analytic data retrievals are made against the MySQL/slave/replica instance to remove unnecessary loading from the master instance which necessarily is responsible for writes.

AWS Kinesis

Right now Kinesis support is limited to publishing event data to S3 for redundant storage. One could probably run some sick AWS Lamda serverless apps against this stream for some machine learning or whatnot.

Data analytics

The BI admin interface will generally make a database call for a certain group of events (or events by a particular user) between a certain time period.

The serialized event_data will be fetched one record at a time (as unstructured data can be quite large). This would appear to be the most efficient way to lookup data, I can't imagine a more efficient query than SELECTing an AUTO_INCREMENT key. Deletes are not allowed, as I believe that this would reduce the efficiency of the search, that is having gaps in the primary key.

Modifying event_data is not allowed through the web interface. Of course, you can do anything you want directly to the database.

Using Flask over Django for business intelligence

DjangoFlask

It's quite clear above how much more complex Django is than Flask. Making this system efficient enough to run economically depends on careful configuration and direct access to the database. For example, the event_type_id is one of the indexes on the event_log table. Since 65k would seem to be an enormous amount of events to be keeping track of, it saves a lot of RAM by indexing a potentially huge number of rows against a 2 byte value.

One could argue that the user_id field could be optimized by reducing to unsigned 32 bit. I might do that, but that's the good thing about open source software. Very easy to hack this thing up for your own use case.

I welcome suggestions and contributions, particularly if somebody that is an expert at PostgreSQL could implement support for that, I would be extremely appreciative.

Forums
Technical Support