First, you need to choose an interesting problem or task to focus your capstone project on. Some good options to consider include:
Predicting house prices based on features like size, number of bedrooms, location, etc. Housing price data is widely available online and is a classic regression problem.
Classifying images into categories like dogs vs cats. Image datasets like CIFAR-10 are standard benchmarks for computer vision models.
Analyzing sentiment in customer reviews or tweets to predict if a comment is positive, negative or neutral. Large review datasets are available to work with.
Using NLP techniques to automatically summarize articles or other text data. Summarization is a challenging open problem in the field.
Building recommender systems to suggest products, movies, music or other items to users based on their preferences and behavior. This has many real-world applications.
Developing chatbots using techniques like sequence-to-sequence learning for natural language conversations. Datasets of conversations can be found online.
Once you’ve identified a problem area, perform background research to understand the domain, any related work that has been done, available data sources and appropriate ML techniques to apply. Spending time on research up front will help focus your efforts.
Some key sources for research include: popular machine learning papers on arXiv and conferences like ICML, NIPS and ICLR, academic publications on your specific problem domain, and datasets available on sites like Kaggle. Go through kernel notebooks and talk to professors for additional guidance.
With the problem defined and research completed, the next step is collecting or generating any data required to train and evaluate models. For some tasks like image classification, ready-to-use datasets exist. But other problems may require collecting your own raw data and preprocessing it into a usable format for ML algorithms.
Ensuring you have a suitably large, diverse, cleaned and preprocessed dataset is essential for model development. Over 15,000 to 30,000 datapoints is generally a minimum for most deep learning problems. Synthetic data generation is another option when real data is limited.
After data collection and preprocessing, you’re ready to start implementing machine learning solutions. Some key stages include:
Exploratory data analysis to understand patterns, outliers, missing values etc. Visualize distributions, correlations between variables.
Feature engineering – Derive new input features that may help predictive models. For NLP, features could be part-of-speech tags, character/word n-grams.
Model prototyping – Quickly try different algorithms like logistic regression, decision trees, random forests, SVMs on your task to evaluate which work best.
Deep learning model development – For image/text domains, develop convolutional or recurrent neural networks. Hyperparameter tuning is critical.
Model evaluation – Use performance metrics like accuracy, precision, recall etc. to quantitatively compare models. Holdout validation sets prevent overfitting.
Model deployment – Save the best models and write code/notebooks to serve predictions as a web API or application.
Effective documentation of your entire process is vital, through formatted READMEs, notebooks with explanations, diagrams and presentations. Clearly outline the problem, methodology, results and conclusions reached.
Once complete, share your work on platforms like GitHub to get peer feedback. Consider submitting a paper to a conference if results are novel. Capstone projects are a chance to gain real experience in all stages of an applied ML workflow on a self-directed task. Have fun exploring your chosen problem in depth!
