Consolidating Unstructured Knowledge into Structured Documentation

Context

Over the last few years, the usage of large language models (LLMs), Retrieval-Augmented Generation (RAG), and Agentic AI has increased, as has the quality of the generated outputs. While RAG enables LLMs to leverage (internal) company information to answer more complex and detailed questions, we face the issue that this information is not necessarily well-structured, centralized, or available in high quality (e.g., a significant amount of knowledge resides in emails, where the information is distributed across multiple conversations, folders, and mailboxes). Therefore, the quality of answers produced by RAG systems strongly depends on the quality of the available information.

Motivation

In a company, a substantial amount of implicit knowledge is stored in personal or organizational notebooks. This information represents a valuable asset for the organization and would be beneficial for day-to-day operations, but it is difficult to access if not properly processed. This project tackles this challenge

Goal

Develop a proof of concept (PoC) that integrates diverse unstructured text data (stretch goal: other modalities such as images and tables) from multiple sources, consolidates the information into a single structured information source, and presents it as documentation based on Markdown files.

The overall goal of this project is to assess whether the merging of multiple sources and information is possible with satisfactory quality, enabling the generated documentation to be used as a reference for downstream tasks such as a chatbot or as a knowledge resource for employees. For the PoC, different notebooks from employees with SAP expertise are made available (most of them in German, could be discussed).

The project will be completed by roughly taking the following steps:

  1. Extract the information from the source files
  2. Design and implement a workflow to consolidate the information into a single source
    1. Topic modelling: Identify topics present in the provided information
    2. Summarization: Consolidate information across multiple sources by topic
  3. Transform the consolidated information into Markdown documentation
  4. Evaluate the system’s performance and explore different approaches for improvement

Requirements

Pointers

Contact

Further Information

About inpeek: We are a young IT technology company and are specialized on consulting and development services for SAP and software engineering solutions. Based on a strong technological foundation, we offer innovative, user friendly and practical solutions for differing business units and sectors — and all that by being highly agile and passionate.

What we are looking for: We think that this project suits best for master students which have already gained some experience with GenAI solutions and are highly motivated to work independently on a project at the intersection of research and practice. From our perspective, the project is less about a great software engineering setup (although we totally respect the requirements from the seminar), but rather developing a PoC which proofs if the approach can work and is scalable.

Commitment: inpeek is committed to providing the students with the necessary data and information, as well as answering questions regarding the requirements and the use case. However, we will not be able to offer day-to-day support on technical questions. We would appreciate it if a designated person from the SEG could assist the students with technical issues and serve as a single point of contact for the team. Of course, we will provide support to the best of our availability.

Availability Contact Person: Fabian Hüni is regularly available on Mondays and Tuesdays in Bern or online, and by arrangement on other days online.