How to build a technical design architecture for an analytics data pipeline

An overview of designing & building a technical architecture for an analytics problem.

Sneha Mehrin
4 min readAug 15, 2020
https://memegenerator.net

This article is a continuation of the previous post and will outline how to transform our user requirements into a technical design and architecture.

Let’s summarise our two major requirements:

Let’s target our analytics problem first!!

Discovery phase is usually the hardest , because you have to engage multiple stakeholders and tech team to build a right solution.

The only way to get through this is to ask Questions!!!

imgflp.com

Key Questions to ask during design phase

  1. What are the Key KPI metrics ?
  • Number of questions per day(should be able to visualise by month)
  • Number of answers per day(should be able to visualise by month)
  • Number of accepted answers in a day( should be able to visualise by month as well).
  • Number of unaccepted answers in a day(should be able to visualise by month as well).
  • Average View Count of a question.
  • Number of questions with no answers.
  • Number of votes in a day

2. Does this key metrics need to be calculated or is it readily available in a database?

  • Data will most likely be available at the lowest granularity (day wise) in a database.
  • Huge organisations will have data-warehouses which will have data marts to aggregate this data by month or year.
  • Aggregation can be performed in analytics to give a monthly or yearly view depending on the data volume and the timeline of data needed.
  • For instance, if the requirement is to show 5 years worth of data and then current year day wise- the data volume will be huge. In this case it is better to load the aggregated 5 years data into analytics from the data-warehouse and load the day wise raw data into analytics and only perform calculation for the day wise metrics.

3. Where does the data come from?

  • Historical data might be available in a data base or data-warehouse.
  • Real time data can be ingested through stackapi.
  • In this project, I will be using the stackapi to stream the data & Kinesis Data Generator to mock up some streaming data

In an ideal situation, historical records should be loaded as a one time activity and daily questions should be stored in the data lake and synced by analytics.

4. What is the format of this data?

  • Data Streamed using stackapi is in the form of json format.

5. Is there any additional data modelling required?

  • Data streamed through stackapi is in the form of a fact table. No further modelling is required.

6. Do you need stream processing or batch processing?

  • For the user group identified, batch processing would suffice.
  • Jobs can be scheduled on a daily basis and the dashboard can be refreshed on a daily basis.

7. Where can you store the data?

  • Data can be stored in Amazon Redshift.

8. What would be the volume of the data?

  • Stackapi has a limitation 10,000 requests per day, therefore we have a limitation of the number of records that can be streamed per day.
  • In-order to have a wholesome view of the pipeline, I also used Kinesis data generator to mock up the data.

9. What will be the visualisation tool?

  • Einstein Analytics will be used to visualise this.

Technical Architecture

After having an idea of all these questions, we can now conceive a technical architecture diagram to build a data pipeline.

Technical architecture Diagram

Brief Overview of the data pipeline:

  • Kinesis Firehose is chosen to stream the data from stack api and output it to S3 bucket folder.
  • Spark will batch process the streams from S3 on a daily basis and output the transformed data back to Redshift- This will be a script scheduled on ec2 once every day.
  • Einstein Analytics will use it native S3 connector to sync the data and display it in the dashboards- Dashboards will be refreshed everyday with the data of yesterday’s

In the upcoming articles, I will be exploring each component of the pipeline in depth.

Here is the article describing how I streamed the data using Kinesis and stored it in S3 for further processing!

--

--