The Data
We've used data from NASA's Kepler telescope, designed to discover Earth-like planets
orbiting other stars. The dataset consisted of labeled time-series data, containing over 3,000 light
flux values per solar system from the host star. These Light flux values represent the amount of light
received from a star over time, and significant dips in these values can indicate the presence of an
exoplanet as it passes in front of the star, temporarily blocking some of its light.
Each star in our dataset was given a label indicating whether it had at least one exoplanet in its
orbit. This labeling allowed for training our machine learning model to distinguish between systems
with - and without exoplanets based on light flux patterns.
The Process
Before creating or even starting to train our model, we needed to ensure our data was in great condition. This involved several preprocessing steps to address potential issues:
- 1. Handling Missing Values:
We checked for any NaN (Not a Number) values in the dataset, as these could disrupt the training process. Luckily, there weren't any, due to the high-quality and accuracy of NASA's data.
- 2. Balancing the Dataset:
One significant challenge was the imbalance in our data. Exoplanets are rare, and thus the number of stars with detected exoplanets was significantly lower than those without. To prevent the model from becoming biased towards the majority class (stars without exoplanets), we used a tool to balance the data. This balancing process involved techniques such as oversampling the minority class to give both target classes equal weight in the training process.
- 3. Data Scaling:
Scaling the data was another essential step. Since the range of light flux values could vary widely, we normalized these values to ensure that the model could learn effectively from the data without being influenced by the magnitude differences.
With our data preprocessed, we were ready to train our machine-learning model. The training process
involved feeding the model the labeled time-series data and allowing it to learn patterns associated
with the presence of exoplanets. We used a portion of the data for training and reserved a separate
portion as a test set to evaluate the model's performance after we finished training.
Finally, we tested our trained model on the test set, which it had not seen during training. This step
allowed us to ensure that our model could generalize well to new, unseen data. We were so happy to see
that our model performed amazingly, accurately predicting the presence of exoplanets during our testing using the test set.
The Impact
Our trained model can now predict whether a solar system contains an exoplanet based on over 3,000 light
flux values. This advancement is particularly exciting for the teams at NASA, as it provides a powerful
tool to quickly scan through millions of star systems for potential exoplanets. The ability to automate
this initial screening process can save researchers there, countless hours and help to focus their efforts on the
most promising candidates for further study and analysis.
By using the vast amounts of data collected by the Kepler telescope over the years, and applying advanced machine
learning techniques, we are contributing to the ongoing search for exoplanets and the broader
understanding of our universe. It shows that with a little help, we can solve problems from a new perspective, and maybe broaden our knowledge in the process as well.