Projects and Papers


Projects



1. Machine Learning Model Deployment using Heroku and FastAPI

   In this project, a classification model is developed on publicly available Census Bureau data. A remote DVC (Data Version Control) is pointing to AWS S3 bucket. DVC is used as a data versioning tool. It keeps track of the data workflow and can be used as an experiment management tool as well. To classify the data samples, Random Forest is selected. Additionally, K Fold cross validation is used as a resampling technique to evaluate machine learning models on a limited data sample.
    In order to monitor the model performance on various data slices, several unit tests are created. The model API is served using the FastAPI package. Both the slice-validation and the API tests are incorporated into a CI/CD framework using GitHub Actions. Heroku is used for API deployment. A Heroku app will be deployed from Github repository. Automatic deployments will be enabled only if continuous integration is passed. Flake8 is used as a python linter for style checking.

2. Customer Segmentation Report for Arvato Financial Services

   This project is worked as a capstone project for the Machine Learning Engineer Nanodegree offered by Udacity. Data is provided by Bertelsmann Arvato Analytics. Supervised and unsupervised learning techniques are used to analyze demographics data for customers of a mail-order company (Arvato Financial Solutions) in Germany against demographics information of their existing clients. The main goal of this project is to identify the right people, who can be potential future customers.
   Principal component analysis (PCA) is used for dimensionality reduction. KMeans helps in the process of population segmentation and determines which of these segments is more similar to the real customers. Next, a supervised learning model is trained using historical responses of marketing campaigns. This model will be further used to predict which individuals are most likely to convert into becoming customers for the company. Several ensemble methods are used to build the model, while Grid Search is used to tune the hyper-parameters. ROC-AUC curve assesses the model performance.

3. Conceptualization and Development of a Predictive Big Data Streaming Application in the Context of Automotive Manufacturing (Master Thesis Project)

   Data is increasingly affecting the manufacturing industry. Every second, Internet of Things devices generate large amount of sensor data. Therefore, it becomes necessary to process data not only fast and efficiently, but also instantly. The digitalization of industrial production enables the usage of data-driven techniques to extract insights and knowledge from large data volumes. Knowledge-discovery in the context of automotive manufacturing improves the process of decision-making, enables the functionality of intelligent services, and detects occurring manufacturing anomalies. However, dealing with large volumes of data at high velocity is challenging. Collecting sensor data, configuring a set of anomaly detection rules and processing/analyzing data in real-time is costly and hardly maintainable.
   In this paper, using the concepts of lambda architecture, a well-defined, efficient, and scalable technique adapted for real-time predictive maintenance is implemented. On such basis an architecture, which enables both batch and stream processing at low latencies is conceptualized. To prove the functionality of such a platform, a pilot anomaly detection application is developed. It covers an end-to-end framework, including data sourcing, processing, transformation, model training, prediction, and visualization. On the batch layer, high volumes of complex historical datasets are analyzed. Anomaly detection rules (i.e. learning models) are extracted and constantly updated to improve the analysis accuracy. On the speed layer, data is ingested and analyzed instantly during the manufacturing process. A streaming engine running on the production network enables real-time prediction.

4. Modelling and Prediction for Movies

   This project is presented at Fraunhofer Institute, as part of RWTH Lab. Several techniques such as multivariate regression and feature modelling are applied to predict movie ratings. The application is built in R environment for computing and graphics. The research questions addressed in this analysis are:
• Is there any correlation among general features that make the audience like or dislike a movie?
• Analyzing a set of historical data, which attributes have mostly affected the movies' popularity in the past?
• May we predict the IMDB rating of a movie based on its genre, MPAA rating, run length, or director? Which variables have the highest impact on the final prediction?
   Predicting the movies' popularity ahead of their release is a valuable metric for many critics. Companies like Netflix perform such kind of analysis to assess the potential success of upcoming movies.


Papers



1. SPARQL Query Optimization for Federated Linked Data

   The Web has evolved from a system of internet servers supporting formatted documents into a web of linked data. In the last years, the Web of Data is constantly growing. Consequently, it has developed a large collection of interlinked data sets from multiple domains. To exploit the diversity of all available data, federated queries are needed. However, many problems such as processing power, query response time, high workload or outdated information are hindering the query processing. In this paper, I am aiming to explain various optimization techniques which have the potential to lead a significant improvement on the final query runtime. I will start by briefly introducing recent approaches of federation and show why SPARQL federation endpoints are mostly in my focus. Specifically, I will compare state-of-the-art SPARQL query federation engines and analyze respective optimization approaches. The main federation engines I will analyze in terms of query optimization are FedX, DARQ and SPLENDID. As the result I provide concrete examples and conclude which of the engines has the best performance based on the query execution time as key criterion.

2. An Overview of Visualization in Mathematics, Programming and Big Data

   Visualization is a descriptive way to ensure the audience attention and to make people better understand the content of a given topic. Nowadays, in the world of science and technology, visualization has become a necessity. However, it is a huge challenge to visualize varying amounts of data in a static or dynamic form. In this paper we describe the role, value and importance of visualization in maths and science. In particular, we are going to explain in details the benefits and shortages of visualization in three main domains: Mathematics, Programming and Big Data. Moreover, we show the future challenges of visualization and our perspective how to better approach and face with the recent problems through technical solutions.