cover image
Just Put It On a Map
20 Mar 2026
progressandpoverty.substack.com

An underrated strategy for urbanist persuasion, powered by open source tools

cover image

A huge treasure trove of songs and interviews recorded by the legendary folklorist Alan Lomax from the 1940s into the 1990s has been digitized and made available online for free listening.

Electronic Texts of H.P. Lovecraft's Works
2 Mar 2026
hplovecraft.com
cover image

Join the discussion on this paper page

cover image

Land Report 100 Explore the 2025 Top 100 U.S. Landowners. View Top 100 Presented by: Land Report 100 Explore the 2025 Top 100 U.S. Landowners. View Top 100 Presented by: Land Report 100 Who is America's Largest Landowner? This question is the quest

cover image

How to Design Production-Grade Mock Data Pipelines Using Polyfactory with Dataclasses, Pydantic, Attrs, and Nested Models

cover image

Humanity was already enjoying motion pictures a century ago. But the ability to do so at home still lay a few decades in the future, and the ability to pull up a movie on demand through a streaming service much further still.

cover image

Spotify's library was scraped in the name of music preservation, but will this make illegally training AI even easier?

cover image
Backing up Spotify
21 Dec 2025
annas-archive.li

We backed up Spotify (metadata and music files). It’s distributed in bulk torrents (~300TB). It’s the world’s first “preservation archive” for music which is fully open (meaning it can easily be mirrored by anyone with enough disk space), with 86 million music files, representing around 99.6% of listens.

cover image

Streamline your financial data workflow with OpenBB. Learn setup, data extraction, and automation using Python—perfect for analysts and economists.

https://www.datacentermap.com/
15 Dec 2025
datacentermap.com
cover image
GlobalBuildingAtlas LoD1
12 Dec 2025
tubvsig-so2sat-vm1.srv.mwn.de
cover image
College Sports Finances Database
25 Nov 2025
sportico.com

Sportico is maintaining an interactive, real-time database that tracks the official balance sheets of public FBS university athletic departments.

User:Birdman86 - Wikimedia Commons
22 Oct 2025
commons.wikimedia.org
Browse Catalog | LibriVox
22 Oct 2025
librivox.org

LibriVox

cover image
65 Essential Children’s Books
15 Oct 2025
theatlantic.com

Illustrated titles that teach kids to love literature

cover image
Advanced Datasets for AI & ML Projects
31 Aug 2025
amanxai.com

In this article, I'll take you through a list of 20 advanced datasets you should try to build your next AI & ML projects.

cover image

Publicly available datasets in recommender research currently shaping the field.

cover image
International Postal & Zip Code Database
5 May 2025
geopostcodes.com

GeoPostcodes provides the world’s most comprehensive postal/zip code database. Complete, accurate, always up-to-date and enterprise-ready.

cover image

This Statology Sprint brings together our most valuable content on Faker, Python's powerful synthetic data generation library, to help you create realistic, privacy-compliant test data for your projects.

cover image
I Analyzed Chord Progressions in 680k Songs
18 Apr 2025
cantgetmuchhigher.com

And the results surprised me

OpenTimes
18 Mar 2025
simonwillison.net

Spectacular new open geospatial project by [Dan Snow](https://sno.ws/): > OpenTimes is a database of pre-computed, point-to-point travel times between United States Census geographies. It lets you download bulk travel time …

cover image
Visualizing all the books in the world
26 Feb 2025
flowingdata.com

To show a catalog of almost 100 million books in one view, phiresky mapped them based on International Standard Book Numbers, or ISBNs, with an interactive visualization.

US Counties Database | Simplemaps.com
29 Dec 2024
simplemaps.com

Free and commercial U.S. counties databases. Includes latitude, longitude, population, largest city, zip codes, timezone, income and more. CSV, SQL and Excel format.

cover image

Publicly available data helps monitor ship traffic to avoid disruption of undersea internet cables, identify whale strikes, and study the footprint of underwater noise.

cover image

AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that

131M American Buildings
6 Nov 2024
tech.marksblogg.com

Benchmarks & Tips for Big Data, Hadoop, AWS, Google Cloud, PostgreSQL, Spark, Python & More...

cover image

There may be a few young people in Britain today who recognize the name Ludwig Koch, but in the nineteen-forties, he constituted something of a cultural phenomenon unto himself.

cover image
List of U.S. stadiums by capacity
2 Aug 2024
en.m.wikipedia.org

The following is a list of stadiums in the United States. They are ranked by capacity, which is the maximum number of spectators the stadium can normally accommodate. All U.S. stadiums with a current capacity of 10,000 or more are included in the list. The majority of these stadiums are used for American football, either in college football or the NFL. Most of the others are Major League Baseball ballparks or Major League Soccer stadiums.Rows shaded in yellow indicates stadium is home to an NFL, MLB, MLS, or NWSL franchise.

cover image

Learn to identify some of the more common types of unexploded ordnance (UXO) in online images using open-source tools and resources.

cover image

The pile dataset has become a hot topic in AI circles, sparking debates about how data is used and the

cover image

The full guide to creating custom datasets and dataloaders for different models in PyTorch

cover image
DMA® Regions | Nielsen
22 Jun 2024
nielsen.com

DMA (Designated Market Area) regions are the geographic areas and zip codes in the U.S. in which local television viewing is measured by Nielsen.

Carabiner Collection
20 Jun 2024
carabinercollection.com
cover image
Plastic Properties Table
12 Jun 2024
curbellplastics.com

Use our plastics properties table to sort and compare plastic materials. Review typical, physical, thermal, optical, electrical properties. Ask an Expert or Get a Quote.

cover image

Solar is in the process of shearing off the base of the entire global industrial stack – energy – and the tech sector still lacks a unified thesis for how to best enable, accelerate, an…

cover image

We live in an era of genre. Browse through TV shows of the last decade to see what I mean: Horror, sci-fi, fantasy, superheroes, futuristic dystopias…. Take a casual glance at the burgeoning global film franchises or merchandising empires.

cover image

Dataset distillation is an innovative approach that addresses the challenges posed by the ever-growing size of datasets in machine learning. This technique focuses on creating a compact, synthetic dataset that encapsulates the essential information of a larger dataset, enabling efficient and effective model training. Despite its promise, the intricacies of how distilled data retains its utility and information content have yet to be fully understood. Let’s delve into the fundamental aspects of dataset distillation, exploring its mechanisms, advantages, and limitations. Dataset distillation aims to overcome the limitations of large datasets by generating a smaller, information-dense dataset. Traditional data compression methods

cover image

Analytics, management, and business intelligence (BI) procedures, such as data cleansing, transformation, and decision-making, rely on data profiling. Content and quality reviews are becoming more important as data sets grow in size and variety of sources. In addition, organizations that rely on data must prioritize data quality review. Analysts and developers can enhance business operations by analyzing the dataset and drawing significant insights from it. Data profiling is a crucial tool. For evaluating data quality. It entails analyzing, cleansing, transforming, and modeling data to find valuable information, improve data quality, and assist in better decision-making, What is Data Profiling? Examining

cover image

Computer vision has advanced significantly in recent decades, thanks in large part to comprehensive benchmark datasets like COCO. However, nearly a decade after its introduction, COCO's suitability as a benchmark for modern AI models is being questioned. Its annotations may contain biases and nuances reflecting the early stages of computer vision research. With model performance plateauing on COCO, there are concerns about overfitting to the dataset's specific characteristics, potentially limiting real-world applicability. To modernize COCO segmentation, researchers have proposed COCONut - a novel, large-scale universal segmentation dataset in this paper. Unlike previous attempts at creating large datasets that often compromised

Home - OpenCorporates
13 Apr 2024
opencorporates.com
cover image

A Fun Tutorial using Python, JSON, and Spotify API! You might find it more comfortable...

Stats about all US cities - real estate, relocation info, crime, house prices, schools, races, income, photos, sex offenders, maps, education, weather, home value estimator, recent sales, etc.

cover image

A proactive, coordinated effort can reduce the chances that manipulations will impact model performance and protect algorithmic integrity.

cover image

Beautiful, free images and photos that you can download and use for any project. Better than any royalty free or stock photos.

Welcome to Open Library | Open Library
10 Feb 2024
openlibrary.org

Open Library is an open, editable library catalog, building towards a web page for every book ever published. Read, borrow, and discover more than 3M books for free.

cover image
Why is the Pile a good benchmark?
4 Feb 2024
pile.eleuther.ai

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

cover image
Sports Data Stuff | CFB PBP
23 Jan 2024
sportsdatastuff.com

College Football Play-By-Play Data

cover image

Developing large-scale datasets has been critical in computer vision and natural language processing. These datasets, rich in visual and textual information, are fundamental to developing algorithms capable of understanding and interpreting images. They serve as the backbone for enhancing machine learning models, particularly those tasked with deciphering the complex interplay between visual elements in images and their corresponding textual descriptions. A significant challenge in this field is the need for large-scale, accurately annotated datasets. These are essential for training models but are often not publicly accessible, limiting the scope of research and development. The ImageNet and OpenImages datasets, containing human-annotated

cover image

Nitasha Tiku / Washington Post: Analysis of 1,800 AI datasets: ~70% didn't state what license should be used or had been mislabeled with more permissive guidelines than their creators intended

cover image

Human motion capture has emerged as a key tool in various industries, including sports, medical, and character animation for the entertainment sector. Motion capture is utilized in sports for multiple purposes, including injury prevention, injury analysis, video game industry animations, and even generating informative visualization for TV broadcasters. Traditional motion capture systems provide solid results in the majority of circumstances. Still, they are expensive and time-consuming to set up, calibrate, and post-process, making them difficult to utilize on a broad scale. These concerns are made worse for aquatic activities like swimming, which bring up unique problems such as marker reflections

cover image

Europe at the end of the nineteenth century and beginning of the twentieth: what a time and place to be alive.

cover image
Library of Short Stories
25 Sep 2023
libraryofshortstories.com

Read For Free, Anywhere, Anytime. An online library of over 1000 classic short stories. H. G. Wells, Edgar Allan Poe, H. P. Lovecraft, Anton Chekhov, Beatrix Potter.

cover image

BookCorpus has helped train at least thirty influential language models (including Google’s BERT, OpenAI’s GPT, and Amazon’s Bort), according to HuggingFace. This is the research question that…

cover image

Large-scale pre-trained Vision and language models have demonstrated remarkable performance in numerous applications, allowing for the replacement of a fixed set of supported classes with zero-shot open vocabulary reasoning over (nearly arbitrary) natural language queries. However, recent research has revealed a fundamental flaw in these models. For instance, their inability to comprehend Visual Language Concepts (VLC) that extend 'beyond nouns,' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or their difficulty with compositional reasoning, such as comprehending the significance of the word order in a sentence. Vision and language models, powerful machine-learning algorithms that learn

cover image
the-markup/xandr-audience-segments
30 Jul 2023
github.com
cover image

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

cover image
How to Anonymise Places in Python
18 Dec 2022
towardsdatascience.com

A ready-to-run code which identifies and anonymises places, based on the GeoNames database

cover image
11 Less Used but Important Plots for Data Science
23 Nov 2022
towardsdatascience.com

Some Unique Data Visualization Techniques for Getting High-Level Insight into the Data

cover image

Posted by Mahima Pushkarna, Senior Interaction Designer, and Andrew Zaldivar, Senior Developer Relations Engineer, Google Research As machine learn...

NCAA Statistics
20 Nov 2022
stats.ncaa.org
cover image
Create geo image dataset in 20 minutes
14 Oct 2022
towardsdatascience.com

Build geo specific subset of LAION-5B

cover image

The Allen Institute’s release includes recordings from a whopping 300,000 mouse neurons. Now the challenge is figuring out what to do with all that data.

National Rail Network Map
14 Sep 2022
arcgis.com
Home page | College Athletics Database
10 Sep 2022
knightnewhousedata.org
cover image

The “unreasonable effectiveness” of data for machine-learning applications has been widely debated over the years (see here, here and…

cover image

We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10...

cover image

Posted by Sara Beery, Student Researcher, and Jonathan Huang, Research Scientist, Google Research, Perception Team Over four billion people live in...

cover image

When AllMusic launched 25 years ago, it wasn't an obvious big data play. But it became one. Hidden in its millions of entries is music's collective history.

cover image

To understand what's happening, but also what's coming if synthetic data does get more broadly adopted, we talked to various CEOs and VCs over the last few months.

cover image
93 Datasets That Load With A Single Line of Code
13 May 2022
towardsdatascience.com

How you can pull one of a few dozen example political, sporting, education, and other frames on-the-fly.

Zip Code Database Lookup | Everything By Zip Code
3 May 2022
everythingbyzipcode.com

Our zip code database is a unified view of public datasets like the Census, American Community Survey, Bureau of Labor Statistics and the CDC, spanning 800+ data points, also offering a free zip code database.

Indus
11 Feb 2022
user.tu-berlin.de
cover image
How to Create Fake Data with Faker
17 Jan 2022
towardsdatascience.com

You can either Collect Data or Create your Own Data

cover image
Curating a Dataset from Raw Images and Videos
16 Jan 2022
link.medium.com

Best-practices to follow when building datasets from large pools of image and video data and tools that make it straightforward.

cover image
MedMNIST v2 Dataset | Papers With Code
29 Oct 2021
paperswithcode.com

MedMNIST v2 is a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28 x 28 (2D) or 28 x 28 x 28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of 708,069 2D images and 10,214 3D images in total, could support numerous research / educational purposes in biomedical image analysis, computer vision and machine learning. Description and image from: MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification Each subset keeps the same license as that of the source dataset. Please also cite the corresponding paper of source data if you use any subset of MedMNIST.

cover image

The release of the FC-AMF-OCR Dataset by LightOn marks a significant milestone in optical character recognition (OCR) and machine learning. This dataset is a technical achievement and a cornerstone for future research in artificial intelligence (AI) and computer vision. Introducing such a dataset opens up new possibilities for researchers and developers, allowing them to improve OCR models, which are essential in converting images of text into machine-readable text formats. Background of LightOn and FC-AMF-OCR Dataset LightOn, a company recognized for its pioneering contributions to AI and machine learning, has continuously pushed the boundaries of technology. The FC-AMF-OCR Dataset is one

DatasetGAN
24 May 2021
nv-tlabs.github.io
Awesome list of datasets in 100 categories
20 May 2021
kdnuggets.com

With an estimated 44 zettabytes of data in existence in our digital world today and approximately 2.5 quintillion bytes of new data generated daily, there is a lot of data out there you could tap into for your data science projects. It's pretty hard to curate through such a massive…

cover image
Web Scraping to Create a Dataset using Python
18 May 2021
thecleverprogrammer.com

In this article, I'm going to walk you through a tutorial on web scraping to create a dataset using Python and BeautifulSoup.

cover image

Create, maintain, and contribute to a long-living dataset that will update itself automatically across projects.

UCI Machine Learning Repository
3 Apr 2021
archive.ics.uci.edu

Discover datasets around the world!

www-eio.upc.edu/~pau/cms/rdata/datasets.html
23 Feb 2021
www-eio.upc.edu
UMLS Metathesaurus Browser
14 Jan 2021
uts.nlm.nih.gov

This is an interface for searching and browsing the UMLS Metathesaurus data. Our goal here is to present the UMLS Metathesaurus data in a useful way.

cover image

Emotional development is one of the largest and most productive areas of psychological research. For decades, researchers have been fascinated by how humans respond to, detect, and interpret emotional facial expressions. Much of the research in this area ...

Explore Census Data
30 Nov 2020
data.census.gov
judicial search
29 Nov 2020
judyrecords.com

Instantly search 740 million+ United States court cases.

Why Data Standards Matter
3 Nov 2020
safegraph.com

The power of join keys and how data standards can make data more valuable and accelerate collaboration and innovation. This is the second installment of the DaaS Bible series.

cover image

Special thanks to Plotly investor, NVIDIA, for their help in reviewing these open-source Dash applications for autonomous vehicle R&D, and Lyft for initial data visualization development in Plotly. Author: Xing Han Lu, @xhlulu (originally posted on Medium) ???? To learn more about how to use Dash for Autonomous Vehicle and AI Applications register for our live webinar with […]

Learn about and download U.S. Board on Geographic Names data from the Geographic Names Information System (GNIS)

cover image
Automated Data Import with Python
1 Jun 2020
towardsdatascience.com

A different approach to import data files automatically in python.

cover image
Generating Synthetic Patient Data
1 Jun 2020
towardsdatascience.com

A quick look at using Synthea

cover image
Time Series Analysis: Creating Synthetic Datasets
17 May 2020
towardsdatascience.com

How to create time series datasets with different patterns

Millions of tiny databases
9 Mar 2020
blog.acolyer.org
cover image

Check out this database of nearly 300 freely-accessible NLP datasets, curated from around the internet.

Slashdot
26 Feb 2020
tech.slashdot.org
cover image

Connect APIs, remarkably fast. Free for developers. - PipedreamHQ/pipedream

cover image
5 Data Cleansing Tools - DataScienceCentral.com
19 Feb 2020
datasciencecentral.com

You need to  analyze data to make more informed decisions. There are many tools to help you analyze the data visually or statistically, but they only work if the data is already clean and consistent. Here is the list of 5 data cleansing Tools. Drake Drake is a simple-to-use, extensible, text-based data workflow tool that… Read More »5 Data Cleansing Tools

cover image

"Enter into picture Swarmplots, just like their name." https://lttr.ai/MJtZ #datavisualization #awesomevisualization #seaborn #python

cover image

This post is about explaining the various techniques you can use to handle imbalanced datasets

cover image
Data USA
19 Feb 2020
datausa.io

The most comprehensive visualization of U.S. public data. Data USA provides an open, easy-to-use platform that turns data into knowledge.

cover image

First trips to Paris all run the same risk: that of the museums consuming all of one's time in the city. What those new to Paris need is a museum-going strategy, not that one size will fit all.

cover image

Atlas: A Dataset and Benchmark for E-commerce Clothing Product Categorization - vumaasha/Atlas

cover image

When computer vision detectors are turned loose in the real world, their performance noticeably drops. In an effort to close this performance gap, a team of MIT and IBM researchers set out to create a very different kind of object-recognition dataset called ObjectNet.

Welcome! | Million Song Dataset
30 Oct 2019
millionsongdataset.com
cover image

Homepage for the National Football League's Big Data Bowl - nfl-football-ops/Big-Data-Bowl

cover image

Baidu this Thursday announced the release of ApolloScape, billed as the world’s largest open-source dataset for autonomous driving…

Game on Paper
24 Sep 2009
gameonpaper.com