What happens when a data scientist finds an error in the data? For example, what happens if you import a data set and one of the Excel tabs is missing? Or the units of measurement for a data set is incorrect or mismatched between sites? Or the time stamp on the data is old, indicating the data is stale. Or the data is just bad? These are questions data scientists have struggled with while building advanced analytics and/or models.
The answer helps explain one of the most inefficient processes in data science – dealing with insufficient or bad data. A data scientist spends most of their day working with data and dealing with the consequences of bad data. Harvard Business Review calls the 3.1 trillion dollars/year spent on poor data quality the “hidden data factory,” – representing the “lengthy and expensive process of manually checking over poor data and making corrections to that data.
OSISoft PI, used by millions of users globally, is an example of a system of record that only works when there is “100 percent confidence in the data quality”, explaining why bad data causes 30% of models to fail with only 4% of missing, stale, or bad PI data. For this reason, finding a method of cleaning the data before its use is integral to any business model using PI. PI data owners tell us, “I need to fix bad data before my analysts use it in their models…Today, it can be weeks before poor data is identified, causing performance value models to fail”. The PI system is a suite of software products that are used for data collection, historicizing, finding, analyzing, delivering, and visualizing. Due to the significant effect that the PI system has on evaluating a company’s products, it is abundantly clear how even one piece of bad data, if mishandled, can have catastrophic consequences. Returning to the initial question of what happens when a data scientist finds an error in the data, few have been able to solve this problem. APERIO, however, has created one…at scale.
APERIO has succeeded where others have failed. It can proactively fix PI data problems at scale, automate anomaly detection, and interactively perform root cause analysis. What’s different about APERIO’s approach is its reliability to produce accurate PI data 100% of the time by using rigorous AI machine learning across millions of live data streams (greater than 2M tags, for example). This machine learning technology measures data quality by ensuring accuracy, consistency, completeness, validity, integrity, and timeliness of all operational data. It can alert at the highest possible level of the asset framework with insight and recommended actions to resolve poor PI data issues.
Not only do companies spend a ton of tremendous time and resources cleaning the data, but their solutions are not nearly as precise as APERIO’s. It is hard to find a one-size-fits-all solution because each data set is different. For example, linear interpolation only works for some correlations, and maximum likelihood estimations can be biased with a small sample size. With APERIO DataWise’s reliance on unsupervised, automated AI machine learning, however, working with missing data gets more and more accurate, making it unnecessary to spend time differentiating different data sets. With quantifiable data quality and trending, APERIO DataWise for PI can contextualize alerts and integrate with PI notifications.
If the quality of your PI data putting your value-added initiatives—i.e., analytics, APC, process optimization, predictive models, AI, etc.—at risk, learn how you can easily deploy APERIO DataWise for PI to monitor, track, and instantly improve the quality of your PI data.
Jacob Albert is a sophmore at the University of Michigan pursuing his bachelor’s degree in Finance at Stephen M. Ross School of Business. As a marketing intern with APERIO in 2022, he spent his time researching data-driven sustainability and risks in converging IT and OT, and then combining it with the advantages of APERIO’s machine learning algorithms to correlate to its predictability and profits.