Hey, how are things? I have often had people ask me where they could find good machine learning datasets for their practice projects. Therefore, I decided to compile a list of all the dataset sites and sources I know of.
Let’s get started.
List of Top Machine Learning Datasets for Practice
This list is a complete compilation of some of the best datasets for your machine learning projects that I’ve used on my own.
Here’s a quick summary of the list:
- Kaggle Datasets
- UCI-ML Repository
- Google Datasets
- Wikipedia Datasets
- AWS Dataset
- Google Speech Commands
- US presidential election tweets 2020
- 538 data
- Buzzfeed news dataset
- Open Government Data (OGD) Platform India
- World Bank datasets
Let’s go over all the datasets listed here one-by-one!
1. Kaggle Machine Learning Datasets
Kaggle Datasets is not just a plain repository of data. Each dataset is a community where in Kaggle Notebooks, you can discuss data, explore public code and techniques, and create your own projects.
If you take the time to dig about to locate them, you will find several different fascinating data sets in all shapes and sizes!
On these, such as the titanic dataset, some of the most common ML analyses were carried out.
2. UCI-ML Repository
One of the oldest data set sites on the web is the UCI Machine Learning Archive.
While the data sets are user-contributed and may have varying documentation and cleanliness requirements, the vast majority of them are clean and ready to be applied to machine learning.
UCI is a great first stop while looking for interesting data sets.
You can import data directly from the UCI Machine Learning repository without permission.
3. Google Machine Learning Datasets
There are online data sets made available by Google that include crime data, medical data from hospitals, bitcoin and other cryptocurrencies, country-by-country cases, and many more.
Here’s another machine learning dataset by Google for your practice project.
4. Wikipedia Datasets
Wikipedia doesn’t need an introduction to anyone who has used the internet! They offer free copies of all available content to interested users.
These databases can be used for mirroring, personal use, informal backups, offline use, or database queries.
They are stored in the form of 7z files each ~6 GB: Download page
5. AWS Dataset
We recently covered Amazon web services lambda programming. The AWS services are cloud server infrastructures provided on a use basis. The AWS data registry exists to help people discover and share datasets that are available via AWS resources.
The AWS dataset for machine learning is comprised of genome and cell research, including COVID and cancer.
6. Google Speech Commands
There are not many free and open-source datasets available to be used for a beginner’s tutorial or that are well adapted for basic keyword detection.
To overcome these problems, the TensorFlow and AIY teams have generated the Dataset of Speech Commands. They used it to apply preparation and sample inference code to TensorFlow.
The data collection includes 65,000 one-second long comments in 30 short words, from thousands of different individuals, submitted by members of the public via the AIY website.
The dataset is designed to allow you to create simple but useful voice interfaces for applications, including common terms such as “Yes,” “No,” digits, and directions. The infrastructure used to produce the data was also open-sourced.
7. US presidential election tweets 2020
The repository contains an ongoing collection of tweets IDs associated with the 2020 United States presidential elections, with data collection starting on May 20, 2019. They leveraged Twitter’s streaming API to follow specified accounts and also collect in real-time tweets that mention specific keywords. To comply with Twitter’s Terms of Service, they only publicly released the Tweet IDs of the collected Tweets. The data is released for non-commercial research use.
8. 538 data
They have latest 2020 datasets, in accordance with the analysis they perform on them. They are available for freely non-commercial use.
9. Buzzfeed News dataset for ML
BuzzFeed, Inc. is an American Internet media, news and entertainment company with a focus on digital media; it is based in New York City. BuzzFeed was founded in 2006 by Jonah Peretti and John S. Johnson III, to focus on tracking viral content.
BuzzFeed News provides open-source data, analysis, libraries, tools, and guides from BuzzFeed’s newsroom.
10. Open Government Data (OGD) Platform India
Open Government Data (OGD) Portal India-data.gov.in-is a platform to support the Government of India’s Open Data initiative. The platform is intended to be used by the Government of India Ministries/Departments of their agencies to publish datasets, records, programmes, resources and applications obtained by them for public use.
It aims to improve transparency in the workings of the Government and also to open doors for even more creative applications of Government Data to offer a different viewpoint.
The Open Government Data Portal India is a collaborative project of the Government of India and the Government of the United States. Open Government Data Portal India is now bundled as a product and made available to countries worldwide as an open-source platform for deployment.
11. World Bank datasets
The World Bank is a global development organisation which provides developing countries with loans and advice. In developed nations, the World Bank periodically finances projects, and collects data to track the progress of these programmes.
You can automatically, without logging, search World Bank data sets. However, there are many missing values
in the data sets, and often it takes many clicks to finally get to the data.
Final Words (And Bonus Machine Learning Datasets)
You’ll also find a lot more datasets like the music repository for notes dataset or a conversation analysis dataset. Go through all of these, and you’ll probably have a lot of practice projects to work on. Until next time.