Challenge Overview
Challenge Objectives
In this challenge you will be developing a model to detect the anomalies in Supervisory Control and Data Acquisition (SCADA) data using unsupervised machine learning.
Project Background
The project focuses on improving the overall productivity of the Wind Turbine Generators (WTG) by evaluating the condition of the internal components, anomalous behaviour, and risk of failure. This challenge is a part of a large data science project WTG Predictive Asset Management (PAM), Topcoder is proud to be a part of. Taking part in this competition you do not only get a chance to work on an important real-world problem, and win good prizes, but you also contribute to the future well-being of our planet.
Customer is planning to improve the overall productivity by identifying any anomalous behavior then corrective actions thereby avoid failures which leads to downtime . The idea is to identify Components tend to show anomalous behavior due to various factors like deterioration of health, Load, wear and tear etc.
Anomaly is defined as any deviation from the reference. It is important to analyze the anomalous behavior and the root cause for the anomaly. Once the root cause is known then corrective actions can be carried out to avoid failures.
The project aims to capture the anomaly detection at Component , Subcomponent , system level, turbine ( WTG) level for that turbine.( Check output format) .
Technology Stack
Python 3.6.x
Individual Requirements
Scope The focus of this challenge is to explore abilities of unsupervised machine learning to detect anomalous behavior of WTGs at turbine, system, subsystem, component, and sub- component level. There can be 2 problems for variable selection to solve here
- Univariate input like power of turbine to detect the anomaly. - Multivariate inputs like different scada tags to detect the anomaly.
- For example, for a particular wind speed, at a certain rpm value for the
rotation of the rotor is producing a temperature beyond the normal values can be considered as an anomaly.
While detecting an anomaly the model should also provide a confidence score.
Data Analysis
You need to perform data analysis work on the given dataset and save the notebook which should be shared along with the submission. It should have the following items covered properly:
- Feature importance and selection procedures, preferably using any graphs
- If other methods tried before finalizing on an approach, you can keep this work also in the notebook for reference purposes
- Code should be documented appropriately (within the code): Explanations are needed on how the different areas of the model work.
Dataset Following are the list of input provided in the forum
- SCADA dataset for 5 years
- Data Dictionary
- Asset Hierarchy and mapping of tags with components and subcomponents which can be used to derive whether the anomaly is detected at turbine, system, subsystem, component or subcomponent level.
Prediction Format
You must submit a CSV file that contains the following details, for the given dataset.
Timestamp, Turbine, System, Subsystem, Component, Subcomponent, Confidence, Scada Tags
A template is provided in the forum which can be used as a reference for output format.
Evaluation
Since we are using unsupervised learning there won’t be any objective evaluation. Client will decide the best model that can be selected for this use case based on their internal evaluation.
Deployment Guide
Make sure you provide a README.md that covers how to run the script in any environment.
Final Submission Guidelines
Submit the following:
- Data Analysis Code Notebook
- Model as Python script
- Documentation
Your submission should include a text, .doc, PPT or PDF document that includes the following sections and descriptions:
- Overview: describe your approach in “layman's terms”
- Methods: describe what you did to come up with this approach, eg literature search, experimental testing, etc. If you augmented any of the ideas provided as input, describe your innovations.
- Materials: did your approach use a specific technology beyond Jupyter? Any libraries? List all tools and libraries you used
- Discussion: Include your analysis in this section. Explain what you attempted, considered or reviewed that worked, and especially those that didn’t work or that you rejected. For any that didn’t work, or were rejected, briefly include your explanation for the reasons (e.g. such-and-such needs more data than we have). If you are pointing to somebody else’s work (e.g. you’re citing a well-known implementation or literature), describe in detail how that work relates to this work, and what would have to be modified
- Data: What other data should one consider? Is it derived? Is it necessary in order to achieve the aims? Also, what about the data described/provided - is it enough?
- Assumptions and Risks: What are the main risks of this approach, and what are the assumptions you/the model is/are making? What are the pitfalls of the dataset and approach?
- Results: Did you implement your approach? How’d it perform? Provide some suggested approaches to evaluate your results.
- Other: Discuss any other issues or attributes that don’t fit neatly above that you’d also like to include