Software Engineer - Enhancement and Migration of Data Pipelines
*Please note: the service contract for this position will not be concluded with Henkel AG & Co. KGaA but with an external party”.
Enhancement and Migration of Data Pipelines
The project delivers a design and implementation of a new concept for data ingestion into Azure Data Lake Gen2. The concept must be in line with the existing concept for ingesting data into Henkel’s data warehouse. Especially curation of the data (naming convention, key definition) should be in line with the curation concept used in NEW BI.
The project is delivered as a SCRUM project using the same methodology and structure as other
NEW BI projects. Henkel is using is a lightweight SCRUM framework, teams and organizations generate value through adaptive solutions for complex problems.
The aim of the project is to deliver data in a consistent well documented way to various use cases within the Henkel Data Foundation.
Background to the assignment
Due to capacity issues, Henkel does not have own employees with sufficient expertise in the following areas needed for this project:
- Azure Dev Ops - especially experience with CI/CD pipeline
- Azure Synapse Analytics (SPAK cluster)
- Python (pySPAKR, pandas and data lake delta)
- Common document models (JSON, XML)
- .Net Framework (especially C#)
Therefore, external expertise is required and the contractor has a unique position and provides significantly different services than the internal staff.
Description of services
The services shall be provided within the framework of an agile development method. The concrete activities required in each case to implement the services commissioned shall be agreed iteratively between the parties within the framework of sprint meetings and implemented by the Contractor within the respective sprints following the sprint meetings. Prior to each sprint meeting, the contractor shall independently check, on the basis of its professional expertise, which individual services are reasonable and feasible within the scope of the assignment in the respective sprint.
The delivery of the project has to be organized by the external consultant incl. planning the work
packages (backlog), organizing meetings, planning deployments, testing and documenting the
developed code. The external consultants will form an independent SCRUM team. A dedicated SCRUM Master is not foreseen (the SCRUM team can pick one team member to partially take over this role).
A Product Owner orders the work for a complex problem into a Product Backlog. The Backlog is maintained is Azure DevOps. The product backlog corresponds to the project plan, the roadmap for what your team plans to deliver. Backlog is created by adding user stories, backlog items, or requirements.
One sprint consists of two weeks and there is a daily standup meeting. By the end of a sprint planning meeting, the team will have two items. The first is a sprint goal (a summary of the plan for the next sprint). The second item is the sprint backlog (the list of projects the team will work on during the sprint). During the meetings, the team reviews its backlog and decides what items to prioritize for the next sprint and the contractor independently performs the following tasks:
- Analysing the current ingestion concept of Henkel’s Data Warehouse (DWH) as a first step before designing a new concept for ingestion of data into the data lake. Data ingestion extracts data from the source where it was created or originally stored and loads data into a destination or staging area.
- Reviewing the result of the Henkel Data Foundation (Henkel’s data lake and data warehouse implementation) review (available as Word document and Power Point presentation) to be able to design new concepts.
- Independently organizing Workshops with Henkel employees to discuss the results of the review is part of the project. Workshops can be conducted remotely or on-site depending (taking to the infection rules into account). The workshop will have up to five employees from Henkel and the SCRUM team.
- The current ingestion concepts are described in the wiki documentation of Henkel’s NEW BI project. The project related concept has to be independently reviewed and refined if necessary. Access to Henkel DevOps (including wiki pages), GIT repos and development environments are granted by Henkel in advance.
- Creating a concept for data ingestion and curation of data into Henkel’s Data Lake based on Azure Data Lake technology, taking into account modern DataOps concepts. The concept is to be included into the wiki documentation which is subject to approval by Henkel. The aim is that the concept and the implementation is fully documented in the wiki pages.
- The proposed design and changes to the design needs to be aligned with the Data Engineering CoE of Henkel (the Head of the Data Engineering CoE also acts as a Product Owner for this specific project)
- Independently integrate the new developed ingestion concept into the existing data landscape of Henkel’s Data Foundation, specifically considering the ingestion concept of Henkel’s DWH)
- Implementation of the new developed concept for ingestion and curation of data leveraging Azure Synapse Analytics (SPARK). Ingestion has aspects of both development and operations:
- From a development perspective, ingest pipelines or a logical connection between a source and multiple destinations need to be created. The Curated Zone created then contains curated data that stored in a data model, which combines like data from a variety of sources.
- The SCRUM team does not take over any operational tasks. The developed code and pipeline are handed over to the maintenance team. Corresponding hand over sessions are organized as part of the sprint review.
- Independently perform unit testing of CI/CD (Continuous Integration / Continuous Delivery) pipelines during implementation. During its development, the criteria need to be coded so that results that are known to be good verify the unit's correctness. During test case execution, frameworks log tests that fail any criterion need to be reported in a summary. The summary has to be reported as a result of the CI/CD pipeline.
The project does not have any concrete deadlines. Following the agile manifest the project plans small releases based on their progress.