Building an analytics system utilizing Domain-Driven Design (DDD) and Ruby on Rails

As I close in on building the MVP for Meettrics, I wanted to add the first basic meeting analytics and decided it would be a good time to do an exercise around Domain-Driven Design. This isn't an article for what DDD is. For that you can check out this post by Martin Fowler and then buy the book. I want to look into a tangible example of implementation.

The part of this diagram that I wanted to explore was having duplicated "entities" in different contexts. In the Rails world, a User is a user that is passed around and utilized everywhere. This often leads to god objects that DDD is intended to help reduce.

Initial Schema Design

Meettrics is a calendar scheduling application in the vain of Calendly. It lets people schedule events through a much nicer looking portal and then sync to your calendars. The original reason for calling it Meettrics is that I am looking to provide smart business insights utilizing calendar data. If you sales team slows booking meetings, there is probably a problem to look into. If your dev team is scheduling way more meetings than usual, something probably needs checked out. Can you identify weak communication links in your company by seeing who meets with who? But since this is just the first example metric, we will put in place the metric for calculating total meeting time.

Given that the analytics section will eventually become a very core feature of Meettrics, it was important to take the time and get the schema correct. There could be a large number of rows per user. If I want to store the data points to 15 minute intervals, it could create up to 45,000 data points per metric, per year, per user. However, I would expect most users to probably do around 1/10th that or let's round up to 5k data points per year, per metric.

Analytics Tables

The first thing I was debating was how I wanted to store the metrics. There were two main strategies I was considering.

Store all the metrics in one table with a metric_type style column. This would be nice to select out all a users data points in one query.
Store the metrics in separate tables. This would be advantageous for querying the data and scalability.

Since the tables are going to grow very large and contain millions of rows, I decided to go the second route. I am not convinced that the scale is there to fully necessitate it. I estimate at a sizable user base, I will be inserting 5-10 million records a year per metric. Given a dozen metrics that could be 60 to 120 million records a year. This sounds like a sizable amount and it's starting to get to the point where things get interesting but is still very manageable with basic tech. However, duplication is always cheaper than the wrong abstraction, so I will go the multiple table route at first.

Optimizing schema size

Once the schema was created, I could calculate the approximate size. On my analytics_total_meeting_times table, I have 3 big ints, a timestamp and an integer. Respectively these are 8 bytes, 8 bytes and 4 bytes. Rails has a habit of sneaking those bigint columns into the migrations. A normal integer can go up to a little over two billion. It's probably safe to say that my app will not have two billion users ever. Note, my user here could be multiple items (a user, a team, a company) thus creating more users than login users. However, it's still not going to hit two billion analytics users easily without some weird mistakes going on or me becoming filthy rich.

We can sum that up to get 36 bytes per row not including index space. That would put each user as adding 1.62MB of data per year per metric at the max usage. At the more fair estimate of 5k rows per year, we would weigh in at 160kb or so per year per metric.

For fun, let's see what using integers would be like. We could convert the two bigint's to normal ints and half the storage space. This would result in a total of 1 bigint, 3 ints and 1 timestamp for a total of 28 bytes per row. At the max usage this would result in ~1.2MB of data per year per metric or 140kb at a heavier usage level.

If we take the median estimate of rows inserted per year (90 million), those 8 bytes per row saving add up to 750mb per year. This results in a total size of 2.52GB vs 3.24GB of data added per year (excluding index size).

The biggest advantage to getting compact data is not the reduction of database size. Instead it's query times. Less data on disc to scan equals less query time.

As a quick side note, let's look at an even better normalization example, http request logs. Merits aside of storing requests in the database, check out the following example.

7.7GB per 19.5 million rows

As you see, it's currently sitting at 7.7GB per 19.5 million rows. This comes out to a total of 2.53MM rows per GB. This is a real example that is actually quite good based on what I've seen in production instances. Some reasons for it being better than normal are not tracking much beyond the url and keeping the schema quite narrow. Common rails gems like Ahoy do a good bit worse than the 2.5MM rows per GB with their denormalized schemas. By contrast, a normalized schema recording the same data in an apples to apples comparison could hit 8 million rows per gb (back of a napkin calculations).

Ichnaea, the event tracking in Olympus Framework is currently at 4.5MM rows per GB but it's not quite apples to apples since Ichnaea will track more like utm's and is oriented at marketing. With a bit more tuning, I hope to get that up to 6 or 8MM rows per GB.

Ichnaea normalized schema at 4.5MM rows per GB

Normalized time periods table

I have a table to store the various time periods that can be referenced. I wanted to use a bigint to store the various time periods. This is because they can be up to 12 digits. However, I wanted to keep it 4 bytes on the mettrics table since those would have millions of rows. The values for the period column would be as below.

2021 Year
202105 Month
20210523 Day
2021052314 Hour
202105231415 Minute (15 minute intervals).

This helps me ensure I can easily query any interval I want with integer ranges. If I want to get data for May of 2021 I could do that in a few ways:

Select 2021 to 20210531. This would give me year, month and day level metrics for that user
Select 202105 to 20210531. This would give me month and day level metrics.
Select 20210501 to 20210531. This would give me just day level metrics for May.

Since 12 digits is going to require a bigint, it made sense to put these in their own table and use the 4 byte int column on the various metrics tables to reference the period.

Schema conclusion

These choices may seem small and perhaps even inconsequential in this day and age. Compute resources are cheaper than ever. Why even bother with calculating database size and normalization?

The schema is the foundation of the application. It is the core that is worth spending time to get as "correct" as possible. Correct, as it always does, is going to change based on your use case. In data warehousing, saving space can be extremely beneficial. In analytics processing, denormalized data shines.

I prefer more tables over fewer. Often I find that the reason god objects start to come about is wide tables. Wide tables force you into putting more and more logic on their associated models. Over time, these models become heavily intertwined and really difficult to separate without cleaving off large areas of responsibility.

In short, a good schema helps create well written code and application performance.

Ruby module structure

After deciding my schema, it's time to jump into Ruby code and make this work. This is my first pass. After a few different metrics are added, I'll do another design review to ensure that my class structures are working out as I want them to. I'm also not super worried about optimization here. In particular, n+1 is not a major concern right now. Those are easy enough to fix later if scale is a problem. However, since this work is going to be backgrounded and easily to run in parallel, over optimizing early won't provide a ton of benefit. If it was in the request/response cycle then it would be absolutely worth looking into.

Analytics - top level module
Analytics::Calculators - tabulate the metrics
Analytics::Updaters - update any database record

We will check out all the files in this domain as of now to get an idea of how it works out with the first metric.

Polymorphic entry points to the domain

If we look at our analytics users table, we can see that it is polymorphic. I am finding this a great pattern to simply the domain logic. In my analytics domain, a few different application objects can become AnalyticsUsers. I can track metrics at a User , Team or Company level. Each of those objects gets its own Analytic::User. This is great because from that point on, everything in my Analytics domain is an Analytics::User and I can remove all the conditional code complexity of types in my Analytics module.

To take this to the next level, I introduced an STI component for the Analytics::User.

I set up subclasses of Analytics::User for my different types of analytics users. The reason for this is to provide less conditionals in my models. If I stuck with one Analytics::User class, then I would be doing conditionals in each method for the various types. In the screenshot above, I need to select calendar events for the various analytics users. STI provides a nice way to clean that all up without any code.

At this point, I am unsure of how I will like this long term because the polymorphic association and the type column of STI is a bit redundant. I could settle fully on STI and get the same benefits perhaps with one mechanism. For now, it won't affect my scope so I'm going to keep both in.

The other bit that I decided on is that each domain can touch the underlying persistence models.

The key being that these models and classes are available to any domain because they are just interacting with the database layer. Since these should not contain the business logic and deal only with persistence, this to me does not violate the principles behind DDD which is more about where the logic resides.

Duplicated Domain Model

Because my main thing to analyze is CalendarEvents and that is a table the stores the event times, I need a duplicate CalendarEvent object under my Analytics domain.

Let's see the code:

module Analytics
  class CalendarEvent
    include ActiveModel::Model

    attr_accessor :start_time, :end_time, 
                  :team_id, :company_id, :profile_id

    def time_range
      (start_time..end_time)
    end

    def users
      [profile_user, company_user, team_user].compact
    end

    private

    def profile_user
      Analytics::ProfileUser.find_by(
        userable_id: profile_id,
        userable_type: "Profile"
      )
    end

    def team_user
      Analytics::TeamUser.find_by(
        userable_id: profile_id,
        userable_type: "Team"
      )
    end

    def company_user
      Analytics::CompanyUser.find_by(
        userable_id: profile_id,
        userable_type: "Company"
      )
    end
  end
end

The 3 user queries of course could be optimized. As mentioned this is the first pass so I am waiting to see if the overall methods and classes work together before going to the optimization phase. No need to optimize if it doesn't work at all!

This class is a "null object" / "data object" / "not persisted data holder object". I find when you go back to the books there is a varying array of names that could be applied to it. The important bit is just to find a term that people understand generally.

The methods that live on a class (particularly a model class) should be relevant to all domains. That question becomes a good litmus test for me if I should put this logic in a service object or add it to the model (It's almost always service object).

Helper objects

One of the best ways to level up your programming is to write helper objects to interact with the underlying database models. I am using the term "helper object" on purpose as many of the Rails community is quite familiar with "service objects". The reality is, it's not very important what it's called. I do find myself a bit dissatisfied with the current state of the blogosphere on service objects thus the purposeful distinction.

Service objects are more than classes with one method named call (or apply or create or whatever else). While that's a great start, the next step is expanding service objects to better encapsulate the "Single Responsibility Principle".

Here is an example from Meettrics as well that better demonstrates this:

module Profiles
  class Finder
    def for_email(email, options={})
    end

    def by_id(id)
    end

    def pending_for_company
    end

    def for_company_and_id(id)
    end

    def for_company
    end

    def for_company_paginated(page: 1, offset: 20)
    end

    def for_session
    end

This class is responsible for finding profiles in the system. Because of SRP, this class becomes a chokepoint. All finding of profiles needs to go through here and it consolidates all the logic from across the application. If I add a soft delete feature for instance, this is the only class I need to update.

Anyways, back to the analytics example. I put together some helper objects to help with finding all the time periods applicable to a time range.

module Analytics
  class PeriodsFromTimeRange
    def self.for(time_range)
      new(time_range).lookup_periods
    end

    def initialize(time_range)
      @time_range = time_range
    end

    def lookup_periods
      start_time = time_range.first
      end_time = time_range.last
      periods = []

      while start_time < end_time
        periods = periods + Analytics::PeriodsFromTime.for(start_time)

        start_time = start_time + 15.minutes
      end

      periods.flatten.uniq
    end

    private

    attr_reader :time_range
  end
end

module Analytics
  class PeriodsFromTime
    def self.for(date_time)
      new(date_time).lookup_periods
    end

    def initialize(date_time)
      @date_time = date_time

      raise StandardError.new("Invalid start period time") if minutes_invalid?
    end

    def lookup_periods
      previously_created = ::Analytics::Period.where(period: periods)

      if previously_created.length < 5
        return backfill_periods
      else
        return previously_created
      end
    end

    private
    
    # truncated below this point (implementation details)
  end
end

This is of course not optimized but shows that I have two classes to do this work. I can either get all the periods for a spot time, or a range. This is important because for smaller intervals all the 15 minute intervals need collected. Thus I can't ask for intervals at just the start time of the meeting. I need to get the intervals every 15 minutes.

A not so astute reader would note that this is horribly unoptimized as I need one query for every 15 minutes of meeting length. That is okay for now though. In this case, I realized my "bug" after putting together the PeriodsFromTime class. For the first pass, I decided to reuse it for the time ranges since both ways could be advantageous. In the future these classes will be a good candidate to consolidate.

Updater class

I put together an updater class. The responsibility of this class is to update / insert the proper records for the metric.

module Analytics::Updaters
  class TotalMeetingTime

    def self.for(analytics_calendar_event)
      new(analytics_calendar_event).update
    end

    def initialize(analytics_calendar_event)
      @analytics_calendar_event = analytics_calendar_event
    end

    def update
      analytics_calendar_event.users.each do |user|
        time_periods.each do |period|
          Analytics::TotalMeetingTime.create(
            calculated_on: DateTime.now,
            analytics_user: user,
            analytics_period: period,
            total: meeting_time_for(period, user)
          )
        end
      end
    end

    private

    attr_accessor :analytics_calendar_event

    def meeting_time_for(period, user)
      Analytics::Calculators::TotalMeetingTime.for(
        analytics_period: period,
        analytics_user: user
      )
    end

    def time_periods
      @_time_periods ||= ::Analytics::PeriodsFromTimeRange.for(
        analytics_calendar_event.time_range
      )
    end
  end
end

Yet another bad decision to call these classes an "updater". The reality is that they won't "update" and will create only. The reason for this is that I want to have a data model that stores an OLAP cube. This will allow me to analyze how meeting density changes over time.

In the update method I leveraged my domain models. My Analytics::CalendarEvent object consolidates all of the users and information needed for the event. The Analytics::PeriodsFromTimeRange gets the time range from the domain model easily.

Calculating the statistic

The last bit to look at is the stat calculator. This is the easiest metric overall but the implementation is pretty clean. The stand in Analytics::User object once again shines as a way to collect all the calendar_events.

From there it's just a simple enumerable sum function with some minor tweaks. I ensure that we adjust the start and end times to be in the period window since it's possible to have a meeting over only a partial time of the period.

module Analytics::Calculators
  class TotalMeetingTime
    def self.for(analytics_user, period)
      new(analytics_user, period).calculate
    end

    def initialize(analytics_user, period)
      @analytics_user = analytics_user
      @period = period
    end

    def calculate
      calendar_events.sum do |calendar_event|
        start_time = adjusted_start_time(calendar_event)
        end_time = adjusted_end_time(calendar_event)

        (start_time.to_i - end_time.to_i) / 60
      end
    end

    private

    def adjusted_start_time(calendar_event)
      [period.as_range.first, calendar_event.start_time].sort.last
    end

    def adjusted_end_time
      [period.as_range.last, calendar_event.end_time].sort.first
    end

    def calendar_events
      analytics_user.calendar_events.where(start_time: period.as_range)
    end
  end
end

Takeaways & Conclusion

I am pretty thrilled with the result. The amount of complexity is quite astounding when I look at the corresponding complexity of the code. Given an event, this code will figure out:

Which user needs updated
Which team needs updated
Which company needs updated
All the corresponding metric periods it needs to insert
Adjust dates to be inclusive of the metric period
Sum and add a record for 15 different data points.
Has no long if / case statements

The current code is query heavy. For the first pass that is okay. As I mentioned these are backgrounded. The queries are all simple selects and inserts. Even if there are 50-100 queries, it should still run in 1-2 seconds per event change or addition. Optimization is easy as well if needed. There are a few ways that could go and frankly I need production data to figure out which way to go instead of guessing.

Going with a Domain-Driven Design paradigm here and duplicating some models as needed provided a significant advantage in flexibility of the data model for the analytics domain, resulting in a great simplification and ultimately less code. Stay tuned for further updates on how maintainability works out and what updates I continue to learn for this domain over the next few months.