Automated Feature Engineering web app using Flask+MongoDB+Docker and Make it Python installable for running from the command line
This article is going to talk about what is this Automated Feature Engineering web app and how to make it python installable, i.e. allow users to pip install the package and launch containerized web app using the host server from the command line.
The web app for automated feature engineering is still using the Flask development server and all comments are welcomed. Please find the code here.
I recently just wrap up a terrific journey — with 2 weeks, from a data scientist who has little knowledge on Web App development to build a simple web application that can be run anywhere with Docker installed: All the user need is PIP install the package and run the command to launch the web app. This article might be suitable for you if you are a “Pythoner” and don’t know 1. how to get Flask app and MongoDB up and running in a Docker container 2. then how to make this Docker container a Python package 3. after installing the package, how to make users run it with command line commands and arguments. If you are new to Flask (like me 2 weeks ago, only knew how to use it to create API endpoints for a machine learning model), I would highly recommend this book for you, Flask By Example: after following the 3 examples in this book, you would feel much confident in Flask development.
Introduction to the web app
The purpose of this web app is to free the data scientist from time-consuming codings when doing the crucial but tedious feature engineering on relational datasets: especially when it involves hundreds of variables spread across dozens of tables. The open-source Python package featuretools will be employed to work out this process. To put it simply, if you had experience in Pandas DataFrame or Spark DataFrame, you would probably do some aggregation operations like “count, mode, sum, std…” in “groupby…agg” or some transformation operations, like calculating the square root or customized functions that are applied on every single row of a column; to the single dataset or relational datasets, a lot. The Featuretools is nothing but manages those operations for you by stacking them (called primitive in its context) deeply and synthetically (this process is called Deep Feature Synthesis, the core concept under the hood).
To run DFS, it needs your inputs on what datasets are used (Entities) and the relations of any two (many to one), which would then form an EntitySet. You would also have the flexibility to specify which of those primitives you intent to use or even define your own primitives using their APIs. There are more things about Featuretools and it takes time to sink them in, which is the reason why I initialized this web app to take the burden off your shoulders.
Without further ado, let’s get into the app.
The logic of the application is to collect the parameters that the featuretools requires to generate features through the user-friendly interface and calculate the feature schema for you (only using a sample of the data) through auto_feature. With the configure file (contains the parameters) and the schema file, you could then run the package auto_feature_prod on the whole datasets easily in somewhere with the adequate computation capacity (I will also write an API for you to integrate with your scikit-learn Pipeline in the coming weeks).
STEP 1: Define the config and schema names you like and provide the s3 path for your data files (only support the s3 bucket right now).
STEP 2: Then you will land on this navigation page. In the content, bold black words are what you defined on the above page and red ones are navigation links. As mentioned below, you have to follow the sequence in the navigation panel: the Entities and Relations I explain before; Pivotings will decide which single column or concatenated two columns you want them to be pivoted (for example, dummify a variable); Primitives will offer up ideas on which primitives to be used on which columns.
For Entities, Relations, and Pivotings, I utilized WTForm to make a dynamic selection and MongoDB to store the results; It is quite interactive that you could view the results as you click the “Submit” button and also are allowed to “Delete” each submission.
After Entities, Relations, and Pivotings, the Primitives Table is created as shown in the pic below. I am intending to give you some basic recommended aggregation (agg_prm) and transformation (tfm_prm) primitives for each feature, plus some backup ones (agg_prm_backup and tfm_prm_backup) for your reference. The table is editable, which suggests that you are able to click the cells in agg_prm and tfm_prm and modify the values: for example, if you feel like that the recommended aggregation primitives, for your claimant number and Name features are not proper, then you just click the corresponding cell and delete it. Those you removed will be automatically added to the backup. Here I leveraged JQuery and Ajax to achieve this dynamic change.
STEP 3: Go back to Home (the navigation page is shown in step 1) and ‘Save’ the config file and ‘Preview’ the features. When you click Preview, featuretools’ DFS comes into play: it has two options, one is to only derive schema while the other one is actually calculating the real values. Obviously, the former one will save us a lot of time and I am using it to generate the schema (all the calculated feature names) and provide this webpage for you to adjust the schema before letting DFS do the calculation. There are two tables within this page: the lower one is the “Results” schema and you could “Delete” any of them unuseful that will be listed in the upper “Removed” table in case you regret it.
STEP 4: Now you are good to “Save” the schema on the navigation page.
STEP5: Launch the Prod version or integrate the code in Sklearn Pipeline (will come later).
I am not planning to talk too many details about the flask or frontend codes that render this web application for this time, but I would really like to share some of the topics, which really gave me a hard time, in the future if those interest you: why I am using MongoDB, how to make a dynamic selection via WTForm, how to make the values in one cell of a table change automatically in real-time reflecting change in another cell modified by the user (the JQuery and Ajax behind it), etc.
Next, I would explain how to containerize the app and package it up.
Dockerize the web app and make the whole thing python installable
I will take the package auto_feature_prod as an example. Just a reminder, the auto_feature package is used to take users’ inputs and generate the configure and schema files as interpreted above. A sample of data is enough for that process; With those two files and the whole datasets, you could use auto_feature_prod to produce the final results. Briefly, this package is another web app that asks you to specify the real datasets corresponding to those sample datasets.
To launch the web app, you would simply pip install the package and run auto_run_prod, completing the s3 bucket name and prefix from the command line prompt, then the dockers are started and the web app will be launched on your local server. The user wouldn’t bother to have MongoDB or other packages pre-installed to support the web app because all the dependencies are handled by the package and the docker containers.
It’s still running on the Flask server for testing and debugging:
So, what happened behind it?
The chain of actions behind is described as follows and I’ve attached my file structure for your reference: in setup.py, we define “console_scripts” (the command auto_run_prod in this example) to activate cli.py; In cli.py, we use @click decorator to accept command line arguments and save them to a file then run the run_docker.sh; The script in run_docker.sh will kill the old running docker process and build two docker images (one is for flask app, which is assembled by commands in Dockerfile; and the other is for MongoDB, which is already in docker-compose.yml;) based on docker-compose.yml; More than building those two images, the docker-compose.yml also run the command to launch the main file views.py that is responsible for launching the web app.
I will briefly explain the whole process:
STEP 1: In setup.py, I used the setuptools for creating the python package. The only thing I’d like to emphasize is the “entry_point”: here I will use “console_scripts” and the scripts in the pic below indicates that when you run auto_run_prod from the command line, the function main in the cli.py file from the auto_feature_prod folder will be executed;
STEP 2: For the main function in cli.py, the key part I would to emphasize is the line of code highlighted in the pic, which is changing the directory, otherwise it couldn't find the run_docker.sh file. In the previous step, using the directory of the file setup.py, it could find cli.py in the auto_feautre_prod folder easily through auto_feature_prod.cli ; However, to run the bash script bash run_docker.sh, the os.system would take your current working directory where you don’t have run_docker.sh. Then that highlighted line of code comes into rescue: it changes the director to where the run_docker.sh exists, <the location of the package>/<auto_feature_prod folder>.
STEP 3: The main usage of run_docker.sh is to kill the older docker processes and build docker-compose.yml and make it up running. The reason I selected docker-compose is that I actually have two docker containers: one for the flask app and the other for MongoDB, and Compose is a tool for defining and running multi-container Docker applications.
Let’s take a look at docker-compose.yml. The template I chose is quite simple and easy to understand. There are two services (docker images) I will build: web (for flask app) and db (for MongoDB). Normally, we wouldn’t need to write the whole image codes ourselves but use the base image (you could find much more in Docker Hub) and then install what we need on that image.
In terms of MongoDB, I picked the base image mongo:3.6.1 and it looks like that no more actions are needed and the base image is enough for me.
Regarding web, I wrote a Dockerfile that describes how to build an image. build in docker-compose.yml will find the Dockerfile in the specified directory: here the dot represents the current directory of where docker-composel.yml is at.
The Dockerfile is also quite easy to understand, FROM allows us to initialize the build over a base image (python:3.7 in this case) from docker hub; ADD . /app aims to copy everything in the current directory (our server code and configurations) into the project directory; WORKDIR app/ sets the path app/ (where we copied everything to) in the container as the working directory; RUN executes a command line script to install all the dependencies listed in requirement.txt;
After build the image from Dockerfile, the command will be run with tunneling the port 8000 in docker container to port 8000 in local host and links the MongoDB container (db); In the mounted volumes part, ~/.aws/:/root/.aws/:ro gives the container read-only access to the .aws directory to let it read your AWS SDK credentials. This is quite important if you would like the web app to access to your s3 buckets. The path ~/.aws before the : is where your aws credential file located at your local machine (here is the tutorial if you happened to not know what’s that) and /root/.aws is where you mount those credential files to your container.
STEP 4: when the dockers are built and up running, this app.run in view.py will be executed. I defined the port to be 8000 (matching the on in docker-compose.yml) and host = ‘0.0.0.0’ to make the server listen to all interfaces in case of connection refused;
Now we are good to go!
This was really a splendid journey for me! I would say many challenges I faced were not directly answered by googling but solved by assembling some answers on different topics. I knew it for sure that my solution might not be the best and please feel free to leave any comments! Thanks.