The Data: Harnessing Kepler's Discoveries
Our project utilizes public data collected by NASA's Kepler telescope, a specialized spacecraft designed to discover Earth-like planets orbiting distant stars.
The core dataset we worked with consisted of labeled time-series data related to thousands of stars. Specifically:
Data Type: Time-series data of light flux values.
Data Volume: Over 3,000 light flux values were recorded per star system over time.
Significance: Light flux represents the amount of light received from a star. A periodic, significant dip in these values is the tell-tale sign of an exoplanet passing in front of the star—a method known as the Transit Method.
Each star in our dataset was assigned a crucial label indicating whether it had at least one confirmed exoplanet in its orbit. This labeling created a ground truth, allowing us to train our machine learning model to distinguish between systems with and without exoplanets based purely on their light flux patterns.

Illustration of exoplanets The Process: Preprocessing and Model Training
Before we could even begin to train our machine learning model, we had to ensure our astronomical data was in pristine condition. This involved several standard and essential preprocessing steps:
Data Preprocessing Steps
Step Goal Outcome/Technique Used 1. Handling Missing Values Identify and mitigate any data corruption. We checked for any NaN (Not a Number) values. Fortunately, due to the high-quality and rigorous accuracy standards of NASA's data collection, none were found. 2. Balancing the Dataset Prevent model bias towards the majority class. Challenge: Exoplanets are rare; the number of stars without confirmed exoplanets was significantly higher than those with them (a massive class imbalance). Solution: We used a data balancing tool that implemented techniques like oversampling the minority class (stars with exoplanets). This ensured both target classes were given equal weight during the training process, leading to a more robust model. 3. Data Scaling Ensure light flux magnitudes don't skew the model. Challenge: The range of raw light flux values can vary widely, potentially confusing the learning algorithm. Solution: We normalized these values, scaling all input features to a small, consistent range (like 0 to 1). This is essential for gradient descent-based algorithms to learn effectively and prevents features with larger numerical magnitudes from disproportionately influencing the model. Model Training and Evaluation
With our data successfully preprocessed, we proceeded to train our model.
Training: The model was fed the clean, labeled time-series data, allowing it to learn the subtle light flux patterns associated with exoplanet transits.
Split: We used a portion of the data for training and reserved a separate, unseen portion as a test set to evaluate generalization.
Testing and Validation: Finally, we tested our trained model on the reserved test set. We were thrilled to see that the model performed amazingly, accurately predicting the presence of exoplanets it had never encountered during training. This confirms the model's ability to generalize well to new, unseen astronomical observations.
The Impact: Accelerating Exoplanet Discovery
Our successfully trained machine learning model can now predict whether a solar system contains an exoplanet based on analyzing the more than 3,000 light flux values recorded from the host star.
This achievement provides a powerful new tool for teams at NASA and across the astrophysics community:
Automation and Efficiency: The ability to automate this initial screening process can save researchers countless hours of manual review. It allows for the rapid scan of millions of star systems for potential candidates.
Focused Research: By quickly filtering out low-probability systems, the model helps researchers focus their limited time and resources on only the most promising candidates for labor-intensive follow-up study and confirmation using other powerful telescopes.
By applying advanced machine learning techniques to the vast amounts of data collected by the Kepler telescope over its operational years, we are directly contributing to the ongoing search for exoplanets and expanding humanity's broader understanding of the universe. It demonstrates how innovative perspectives, coupled with powerful AI, can dramatically accelerate scientific discovery.