From CSVs to live dashboards: How we modernized our data access platform

Learn the six steps we took to upgrade data sharing from sending CSV files to using automated end-to-end data pipelines and live dashboards.

Posted: February 2, 2023 by John Lee, Chloe Wang

Photo of a computer monitor with a data graph — ^{Photo by Chris Liverani on Unsplash}

When our data and analytics team started modernizing our data access platform, we envisioned where we wanted to be a year later. We had a lot of work ahead of us, and we've come a long way since.

[ Learn how IT modernization can help alleviate technical debt. ]

As the Data and Analytics squad within IBM's CIO division, we use data and automation to transform our internal application portfolio and drive hybrid cloud platform adoption.

When we began this project, sharing data using CSV files was the norm. This article outlines the six steps we took to modernize from a world of CSVs to setting up automated end-to-end data pipelines and live dashboards, which the organization is using now.

Step 1. Collect stakeholder requirements

We began our modernization project by working closely with our stakeholders to understand their expectations.

One group of stakeholders is the IBM internal application teams that want to move their applications from legacy infrastructure to our hybrid cloud platform. We create modernity indices to drive hybrid cloud platform adoption and track progress. Modernity indices are metrics we define to track the usage of the key features and best practices we expect the CIO division to implement.

Our other stakeholder group is the hybrid cloud platform team, which is also part of the CIO division. They consume the data for getting feedback on hybrid cloud platform performance and improvement.

[ Discover ways enterprise architects can map and implement modern IT strategy with a hybrid cloud strategy. ]

After many interviews and interactions with both groups of stakeholders, we came up with three categories and 13 modernity indices. The categories (with example metrics) are:

Monitoring and security
- Monitoring and security metrics include the Four Golden Signals from the Google Site Reliability Engineering (SRE) team's best practices for monitoring.
Adoption
- Adoption metrics include multi-zone and multi-region for high availability and redundancy with load balancing.
Automation
- Automation metrics include horizontal and vertical pod autoscaling for automated resource optimization.

In addition to the three categories, which include each metric, we also created a scoring system to compare stakeholders' hybrid cloud migration process.

Step 2. Create ETL scripts and data pipelines

Once we collected all the requirements, we built the extract, transform, and load (ETL) scripts that target specific datasets to produce consumption-ready output. Our data sources provide various interfaces for data integration: APIs, SQL database queries, CSV files, and more. Our stakeholders also have different reporting mechanisms: dashboards, CSVs, APIs, and data warehouse access.

To accommodate these disparate requirements, we established a robust pipeline with the following capabilities:

Data-ingestion module with:
- Reusable API module for multiple data sources
- Database connections management, including DB2 and PostgreSQL
- Database queries
- CSV file readability
Data cleansing and normalization operations for:
- Removal of stale or invalid records
- Splitting into granular fields
- Case normalization
Data transformations to support:
- A reporting dashboard for the front-end user interface
- CSV generation for different teams' needs

[ Become a Red Hat Certified Architect and boost your career. ]

Step 3. Package scripts to Python libraries

As our projects grew in size and complexity, we built reusable and robust libraries across all our pipelines to handle data ingestion, cleansing, and transformation. For reusability, packaged Python libraries help us encapsulate functionalities in every class, hide the implementation, and provide straightforward interfaces that our peer teams can use. Regarding robustness, packaged Python libraries make the unit tests much more maintainable to our code base, and we are on the way to meeting a high unit-test coverage rate in Q1 2023.

Step 4. Establish a data warehouse

We use a data warehouse to store our metrics persistently for reporting, analyzing adoption progress, forecasting migration with machine learning and artificial intelligence (ML/AI), and more. We decided that IBM Db2 Warehouse on Cloud solution was the most effective solution for our team because:

It is highly optimized for read-query performance
It is optimized for storing large amounts of data
It is low maintenance
It enables high availability and disaster recovery by default
It is hosted on IBM Cloud

Step 5. Set up cronjobs

We started with ad-hoc reports as proofs of concept to ensure stakeholders agree upon the output. We went through a few iterations for each use case we defined to refine and iterate upon the data output's quality and clarity.

Once we met the requirements and were happy with the outcomes, we needed a scalable way to produce our consumption-ready data on a regular schedule. Using our existing tools, we containerized our ETL applications, hosted them in our IBM hybrid cloud platform, and ran them on a cron schedule. The cron jobs produce a steady stream of data loaded into our Db2 Warehouse on Cloud for consumption using dashboards and other reporting approaches.

[ Find out how to build a resilient IT culture ]

Step 6. Report with dashboards

No data pipelines are complete without a proper consumption mechanism. Because we set up our data warehouse to be consumption-ready at the outset, it is straightforward to create dashboards and reports by reading from it. We chose the Apache Superset data exploration and data visualization platform as it offers a complete self-service package for end users. It provides:

Granular permissions for different user groups
Custom SQL query capability
Dashboard accessibility through a public link
Report downloads

Where are we today?

Today our pipeline readily ingests, transforms, and loads consumable data into our warehouse. Our dashboards read from the warehouse to provide vital key performance indicators (KPIs), statistics, and reports to stakeholders across our organization. And Superset gives our stakeholders the flexibility to explore data and create dashboards for their purposes.

IBM Hybrid Cloud — (John Lee, Chloe Wang, CC BY-SA 4.0)

What's next?

As with many organizations, data access is not quite streamlined. We are working with stakeholders to encourage API development and use them wherever possible, as APIs provide a data abstraction layer advantage. APIs also have security benefits and streamline access methods.

Second, we are investigating how we can structure future data sources for ML operations, especially graph databases.

Takeaways

We learned a lot through this process. Our main takeaways are:

Take the time to architect the complete pipeline to identify any gaps in requirements and create checkpoints and timelines to completion.
Communicate closely and often with stakeholders to ensure the execution is on the right track.
Always think about how the current work will affect future projects to identify areas of opportunity for optimization.

This article originally appeared on Hybrid Cloud How-tos and is republished with permission.

How we use a common CI/CD pipeline to deploy applications at scale

A common continuous integration and continuous deployment (CI/CD) pipeline helps scale best practices across the organization.

5 ways to improve security automation

Consider these tips when designing security solutions to make them as effective and painless as possible.

Topics: Data exchange Data management

OUR BEST CONTENT, DELIVERED TO YOUR INBOX

Privacy Statement