While the hype around big data, IoT and Industry 4.0 is all fairly recent, the core concepts, processes and practices are not new to most manufacturers. Most modern manufacturing facilities collect vast quantities of data each year. However, data collection has historically been captured for a single purpose i.e. for trending or quality checks. Much of this data is then archived and remains dormant with no one questioning its remaining value. With all the advances in collecting and processing data, Cyzag has set out to answer this question – what value is there in data that has become dormant?
In data science projects the various stages are intricately interlinked and depend heavily on domain expertise, both in terms of process expertise and in data science expertise.
Defining a good problem statement is key to a data science project and forms the starting point for bridging the gap between data science, the people that consume data and the results of data science projects. The problem statement is an active and dynamic part of the project; it is likely that it will be refined as the analysis work helps to gain a deeper understanding of the problem.
Following on from the problem statement there 4 main stages in a data science project:
The test case was conducted at a batch plant in conjunction with a global specialty chemicals manufacturer. The site in question has a particularly complex production process. They produce upwards of 50 specialty products and each of these products undergoes several stages of reactor processing and storage before delivery to an end customer.
The problem statement was straightforward – “our final stage reactors are constrained and we need to release capacity”. Through further discussion the problem statement was refined to show that the primary reaction step was the biggest culprit and three products in particular should be the focus of the investigation. A number of assumptions were also introduced:
Just over one third of batches exceeded the ideal batch step time which equated to over 400 hours per year of lost reactor time.
In the first instance the output (step duration) data was analysed and it was clear straight off that the biggest culprit to the variation of step duration over the ideal step duration was product 2 on reactor 1.
The initial problem refinement is an essential step in helping isolate the problem. Typically problem refinement could take one or more iterations, especially when considering a feedback loop with users.
The data engineering phase of nearly all data analytics projects is by the far the most challenging aspect of data analytics projects. Data has to be extracted from multiple sources, issues with missing or incorrect data have to be resolved and data then needs to be structured ready for statistical and machine learning analysis.
In a manufacturing organisation there are, generally speaking, data from multiple sources – these are not limited to, but are typically one of the following:
It is not unusual that it is necessary to combine data from one or more of these sources in order to get the full set of relevant data for analysis.
Many manufacturing plants have been collecting data for several years and the collection and storage of the data was typically for a singular purpose (i.e. trending, or quality check). Not many manufacturers gave much forethought to how data could be used with future analytics tools, and on that basis it has been seen that there are several issues with regards to data quality:
In addition, traditionally, statistical analysis of the manufacturing process has been done in a siloed way, where a single manufacturing step or production unit will be analysed without looking at upstream manufacturing steps or raw material quality. At Cyzag we put the end-to-end manufacturing process at the core of our analytics philosophy; we believe that raw material quality, material states during and following each manufacturing and storing step have equal relevance when solving for a specific manufacturing problem. Structuring the data in the right way helps to solve problems where:
Overall there were multiple challenges in collection, cleaning and structuring the data. The resulting dataset went through multiple iterations of analysis before it was deemed good enough for machine learning models and further analysis. The final dataset had upwards of 350 different variables – a number too high for a human to analyse manually.
In modern data science there are a plethora of statistical tools, algorithms and machine learning models available to make light work of complex and large datasets. Traditional statistical tools tend towards fitting data to a predefined model in order to get an approximation of the data. Machine learning models flip this concept, where the model fits the specific data, ultimately increasing the accuracy of the predictions and insights that can be drawn from these models.
There are a number of different types of machine learning algorithms all with varying levels of complexity and problems that they can solve. In general, however, the overall process that machine learning follows is similar regardless of the algorithm.
The machine learning process follows the following main steps:
For supervised machine learning models, there are key ways to analyse performance of the models (i.e. how good they are at predicting good outcomes based on a previously unseen test data set):
For classification, models are scored using a simple accuracy metric:
For regression, models are scored using one or more statistical methods which measure how close the predictions are to the real value. The r-squared method is one of the most popular. The r-squared method gives an output between 1 and 0. The closer to 1, the better the model is at predicting.
Depending on the application, lower values of accuracy or r-squared are not necessarily a bad thing. Low r-square values for example can still indicate strong relationships between specific parameters and the output.
Feature engineering is performed to boost the scores of the models. Typically this will involve identify which features have low impact scores. It could also involve creating new features based on the current data. Each iteration of feature engineering normally has a positive impact on the overall model performance. The process of feature engineering is also a fundamental part of identifying which features are the most significant.
The results of the case study were promising. The models that were trained were achieving accuracy of 62% and r-squared of 0.45. While significant features could be narrowed down to 8 main influencing variables:
On the basis that the r-squared and accuracy scores were not higher it was necessary to do visual inspections on the output variable on each of the significant features to confirm the results.
Following on from the machine learning results. It was clear that there were significant correlations between the raw materials identified as significant and the step duration. By working together with suppliers to tighten the specifications of raw materials, the process can increase the overall process control and predictability.
In the first reaction step there was 1 significant feature that stood out. This is labelled as “Reaction 1 Feature A”. It was seen that the higher the value of this feature the lower the variation in step duration of the second reaction step.
In the second reaction step there were two significant features that stood out. The first is labelled as “Reaction 2 Feature 1”. It was seen that higher values of this feature also correlated directly with increased variation of the step duration.
The second significant feature was time of reaction of the second stage reaction. When combined with the ambient temperature and “Reaction 2 Feature 1” feature (see figure. 8 below), the results stand out as extremely significant. For higher values of “Reaction2 Feature 1”, the process is highly influenced by the ambient temperature. It would follow that when the temperatures are known to be higher to plan for a lower value of “Reaction 2 Feature 1” (a feature that is easily controlled by the plant).
There were several very convincing results to come out of the case study. Some of the outcomes would require capital expenditure to rectify, others would require communication with the supplier about adjusting specification parameters of the input raw materials but in one case it was possible to improve the process by adjusting one of the process parameters.
Overall this project resulted in 32% time saving in the constrained production unit.
In this case study, Cyzag set out to prove that by using advanced data analytics tools, in particular by structuring data in a way that represents the reality of a manufacturing process demonstrates that there is significant value in dormant data.