Before becoming a data scientist, the way data appeared on a web page seemed like magic (to be honest some parts of the process still feel that way!), but having been involved in many aspects of the data pipeline at Freya Systems I wanted to share an overview of this process.
The goal of the data pipeline is to provide insight into raw data and visualize it in a convenient way. Typically this involves storing data in a database, queries and/or algorithms to aggregate and analyze the data, and a web application that displays these results and updates whenever new data arrives.
A data pipeline is a sequence of steps that involve moving and processing data in order to automate analysis, storage, and visualization. It may include both manual and automated steps to most efficiently use resources, depending on how simple or advanced it needs to be. Not all projects will follow the exact same steps in the data pipeline, but the following steps are common through many of our projects.
Data Pipeline Steps
Raw Data
Starting with the raw data source, the first step in the process is to transform the data into a format for easy database storage and store the data in a database. Examples of databases are PostgreSQL, Teradata, MySQL, and MS SQL Server. The raw data may be in a simple CSV format, in which case the database can easily ingest the data as is, but if the data does not come in this format, it is necessary to transform the data via a custom file parser, which could be a simple Python/R script or a more complex Java program. Domain expertise is highly beneficial here so that columns are named properly, values are entered as specific data types, any null values are interpreted correctly, and most importantly we understand how any file sources or database tables relate to each other.
Tools that help include Apache NiFi so that when new data arrives the ETL will automatically process it and store it in a database, either upon arrival or on a timed schedule.
Queries and Algorithm
Once the data is in a database, we are free to perform exploratory data analysis to understand what information needs to be communicated and develop prediction algorithms, if desired. The heart of a data scientist’s work is here, within the data, making sense of the information and identifying potential sources of insight to be gained.
We typically run SQL queries on the data in a script, either R or Python, which allows us to automatically connect to the database using built-in libraries. The results of these queries are stored as various data types, but most commonly as a data frame, which is most similar to a SQL table or CSV file. The R or Python script has a section to access this data, and another section to perform any transformations that are not easily done within SQL queries. These transformations help shape the data into a form that is constructed well in order to apply machine learning models.
In another section, the model is applied to the data, producing a prediction value per row of the data frame. The data should have one row per event/item, and there should be one or more identification columns, one or more feature columns that are the input to the model, and after running the input features through the model, one or more output columns are calculated as a prediction.
The final section of the R or Python script saves the data frame results into a CSV, which allows the ETL to pass the results to the next step of the data pipeline. The ETL handles when the algorithm is run so that it can manage the output as well, but during the development of the process, this step can be manually run.
Transforming the Output
The next step of the data pipeline involves preparing the data to be displayed on a web application. The ETL loads CSV files into the necessary database and makes sure that all data needed for the web application is in the right tables, possibly by pulling from multiple database sources. This is also where any materialized views performing complex business logic get updated so they are ready before the web application needs them. We typically don’t work with live streaming data in real-time so this process updates as necessary, which could be different for different applications, e.g. daily, monthly, etc.
Web Server and Visualization
The final step of the data pipeline is using a web server to share the data on a web application. Software developers build the architecture for the web application and data scientists help design the details of any tables or visuals that should be displayed. A simplified description of a web application is a web server accepting HTTP requests from the backend that connects to the database and serving those requests to the frontend to be displayed with HTML. There are many different tools and programs that can be used to build web applications, for example, Java and Spring Boot backend with Angular or React frontend.
The backend connects to the database and queries the data that the frontend needs in its display elements, which are simpler queries than the ones from the algorithm section that dealt with all of the business logic. Most of the data being queried here should be in a format that only needs simple filters or quick aggregations. The results from the backend are passed through to the frontend in JSON format and converted to HTML elements to be displayed throughout the web application in various tables or graph visuals to best communicate the insight derived from the data science work.
Automated Solutions
When new data arrives the ETL automatically takes the data through these steps of the data pipeline so that the web application is updated with the latest information.
This overview provides a framework for anyone wanting to learn more about how data can automatically be analyzed and displayed in a web application. This can also help anyone getting started with a data analytics project think about the different resources that will be used throughout development.
If you are interested in visualizing advanced insights on a web application, contact Freya Systems to learn more.