How to build a technical design architecture for an analytics data pipeline
An overview of designing & building a technical architecture for an analytics problem.
This article is a continuation of the previous post and will outline how to transform our user requirements into a technical design and architecture.
Let’s summarise our two major requirements:
Let’s target our analytics problem first!!
Discovery phase is usually the hardest , because you have to engage multiple stakeholders and tech team to build a right solution.
The only way to get through this is to ask Questions!!!
Key Questions to ask during design phase
- What are the Key KPI metrics ?
- Number of questions per day(should be able to visualise by month)
- Number of answers per day(should be able to visualise by month)
- Number of accepted answers in a day( should be able to visualise by month as well).
- Number of unaccepted answers in a day(should be able to visualise by month as well).
- Average View Count of a question.
- Number of questions with no answers.
- Number of votes in a day
2. Does this key metrics need to be calculated or is it readily available in a database?
- Data will most likely be available at the lowest granularity (day wise) in a database.
- Huge organisations will have data-warehouses which will have data marts to aggregate this data by month or year.
- Aggregation can be performed in analytics to give a monthly or yearly view depending on the data volume and the timeline of data needed.
- For instance, if the requirement is to show 5 years worth of data and then current year day wise- the data volume will be huge. In this case it is better to load the aggregated 5 years data into analytics from the data-warehouse and load the day wise raw data into analytics and only perform calculation for the day wise metrics.
3. Where does the data come from?
- Historical data might be available in a data base or data-warehouse.
- Real time data can be ingested through stackapi.
- In this project, I will be using the stackapi to stream the data & Kinesis Data Generator to mock up some streaming data
In an ideal situation, historical records should be loaded as a one time activity and daily questions should be stored in the data lake and synced by analytics.
4. What is the format of this data?
- Data Streamed using stackapi is in the form of json format.
5. Is there any additional data modelling required?
- Data streamed through stackapi is in the form of a fact table. No further modelling is required.
6. Do you need stream processing or batch processing?
- For the user group identified, batch processing would suffice.
- Jobs can be scheduled on a daily basis and the dashboard can be refreshed on a daily basis.
7. Where can you store the data?
- Data can be stored in Amazon Redshift.
8. What would be the volume of the data?
- Stackapi has a limitation 10,000 requests per day, therefore we have a limitation of the number of records that can be streamed per day.
- In-order to have a wholesome view of the pipeline, I also used Kinesis data generator to mock up the data.
9. What will be the visualisation tool?
- Einstein Analytics will be used to visualise this.
Technical Architecture
After having an idea of all these questions, we can now conceive a technical architecture diagram to build a data pipeline.
Brief Overview of the data pipeline:
- Kinesis Firehose is chosen to stream the data from stack api and output it to S3 bucket folder.
- Spark will batch process the streams from S3 on a daily basis and output the transformed data back to Redshift- This will be a script scheduled on ec2 once every day.
- Einstein Analytics will use it native S3 connector to sync the data and display it in the dashboards- Dashboards will be refreshed everyday with the data of yesterday’s
In the upcoming articles, I will be exploring each component of the pipeline in depth.
Here is the article describing how I streamed the data using Kinesis and stored it in S3 for further processing!