datasets

International Postal & Zip Code Database

GeoPostcodes provides the world’s most comprehensive postal/zip code database. Complete, accurate, always up-to-date and enterprise-ready.

I Analyzed Chord Progressions in 680k Songs

And the results surprised me

Statology Sprint: Advanced Synthetic Data Generation with Faker

This Statology Sprint brings together our most valuable content on Faker, Python's powerful synthetic data generation library, to help you create realistic, privacy-compliant test data for your projects.

OpenTimes

Spectacular new open geospatial project by [Dan Snow](https://sno.ws/): > OpenTimes is a database of pre-computed, point-to-point travel times between United States Census geographies. It lets you download bulk travel time …

Visualizing all the books in the world

To show a catalog of almost 100 million books in one view, phiresky mapped them based on International Standard Book Numbers, or ISBNs, with an interactive visualization.

US Counties Database | Simplemaps.com

Free and commercial U.S. counties databases. Includes latitude, longitude, population, largest city, zip codes, timezone, income and more. CSV, SQL and Excel format.

These stunning images trace ships’ routes as they move

Publicly available data helps monitor ship traffic to avoid disruption of undersea internet cables, identify whale strikes, and study the footprint of underwater noise.

Harvard and Google to release 1 million public-domain books as AI training dataset | TechCrunch

AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that

131M American Buildings

Benchmarks & Tips for Big Data, Hadoop, AWS, Google Cloud, PostgreSQL, Spark, Python & More...

Free: Download Over 33000 Sounds from the BBC Sound Effects Archive

There may be a few young people in Britain today who recognize the name Ludwig Koch, but in the nineteen-forties, he constituted something of a cultural phenomenon unto himself.

LightOn Released FC-AMF-OCR Dataset: A 9.3 Million Images Dataset of Financial Documents with Full O

The release of the FC-AMF-OCR Dataset by LightOn marks a significant milestone in optical character recognition (OCR) and machine learning. This dataset is a technical achievement and a cornerstone for future research in artificial intelligence (AI) and computer vision. Introducing such a dataset opens up new possibilities for researchers and developers, allowing them to improve OCR models, which are essential in converting images of text into machine-readable text formats. Background of LightOn and FC-AMF-OCR Dataset LightOn, a company recognized for its pioneering contributions to AI and machine learning, has continuously pushed the boundaries of technology. The FC-AMF-OCR Dataset is one

Game on Paper

List of U.S. stadiums by capacity

The following is a list of stadiums in the United States. They are ranked by capacity, which is the maximum number of spectators the stadium can normally accommodate. All U.S. stadiums with a current capacity of 10,000 or more are included in the list. The majority of these stadiums are used for American football, either in college football or the NFL. Most of the others are Major League Baseball ballparks or Major League Soccer stadiums.Rows shaded in yellow indicates stadium is home to an NFL, MLB, MLS, or NWSL franchise.

A Beginner’s Guide to Identifying Explosive Ordnance in Social Media Imager

Learn to identify some of the more common types of unexploded ordnance (UXO) in online images using open-source tools and resources.

The pile dataset has become Big Tech’s secret spice

The pile dataset has become a hot topic in AI circles, sparking debates about how data is used and the

Little-Known Tool Is Giving Instant Access To Vast Amounts of Homebuyer Dat

Comprehensive Guide to Datasets and Dataloaders in PyTorch

The full guide to creating custom datasets and dataloaders for different models in PyTorch

DMA® Regions | Nielsen

DMA (Designated Market Area) regions are the geographic areas and zip codes in the U.S. in which local television viewing is measured by Nielsen.

Carabiner Collection

Plastic Properties Table

Use our plastics properties table to sort and compare plastic materials. Review typical, physical, thermal, optical, electrical properties. Ask an Expert or Get a Quote.

The solar industrial revolution is the biggest investment opportunity in hi

Solar is in the process of shearing off the base of the entire global industrial stack – energy – and the tech sector still lacks a unified thesis for how to best enable, accelerate, an…

Download Issues of “Weird Tales” (1923–1954): The Pioneering Pulp Horror Ma

We live in an era of genre. Browse through TV shows of the last decade to see what I mean: Horror, sci-fi, fantasy, superheroes, futuristic dystopias…. Take a casual glance at the burgeoning global film franchises or merchandising empires.

What is Dataset Distillation Learning? A Comprehensive Overview

Dataset distillation is an innovative approach that addresses the challenges posed by the ever-growing size of datasets in machine learning. This technique focuses on creating a compact, synthetic dataset that encapsulates the essential information of a larger dataset, enabling efficient and effective model training. Despite its promise, the intricacies of how distilled data retains its utility and information content have yet to be fully understood. Let’s delve into the fundamental aspects of dataset distillation, exploring its mechanisms, advantages, and limitations. Dataset distillation aims to overcome the limitations of large datasets by generating a smaller, information-dense dataset. Traditional data compression methods

18 Data Profiling Tools Every Developer Must Know

Analytics, management, and business intelligence (BI) procedures, such as data cleansing, transformation, and decision-making, rely on data profiling. Content and quality reviews are becoming more important as data sets grow in size and variety of sources. In addition, organizations that rely on data must prioritize data quality review. Analysts and developers can enhance business operations by analyzing the dataset and drawing significant insights from it. Data profiling is a crucial tool. For evaluating data quality. It entails analyzing, cleansing, transforming, and modeling data to find valuable information, improve data quality, and assist in better decision-making, What is Data Profiling? Examining

COCONut: A High-Quality, Large-Scale Dataset for Next-Gen Segmentation Mode

Computer vision has advanced significantly in recent decades, thanks in large part to comprehensive benchmark datasets like COCO. However, nearly a decade after its introduction, COCO's suitability as a benchmark for modern AI models is being questioned. Its annotations may contain biases and nuances reflecting the early stages of computer vision research. With model performance plateauing on COCO, there are concerns about overfitting to the dataset's specific characteristics, potentially limiting real-world applicability. To modernize COCO segmentation, researchers have proposed COCONut - a novel, large-scale universal segmentation dataset in this paper. Unlike previous attempts at creating large datasets that often compromised

Home - OpenCorporates

Spotify API: How To Create a Data Set of Songs

A Fun Tutorial using Python, JSON, and Spotify API! You might find it more comfortable...

3D Images of Over 13,000 Museum Specimens Now Free To Everyone

City-Data.com - Stats about all US cities - real estate, relocation info, c

Stats about all US cities - real estate, relocation info, crime, house prices, schools, races, income, photos, sex offenders, maps, education, weather, home value estimator, recent sales, etc.

How to detect poisoned data in machine learning datasets

A proactive, coordinated effort can reduce the chances that manipulations will impact model performance and protect algorithmic integrity.

Beautiful Free Images & Pictures | Unsplash

Beautiful, free images and photos that you can download and use for any project. Better than any royalty free or stock photos.

Welcome to Open Library | Open Library

Open Library is an open, editable library catalog, building towards a web page for every book ever published. Read, borrow, and discover more than 3M books for free.

Why is the Pile a good benchmark?

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Sports Data Stuff | CFB PBP

College Football Play-By-Play Data

‘Let’s Go Shopping (LGS)’ Dataset: A Large-Scale Public Dataset with 15M Im

Developing large-scale datasets has been critical in computer vision and natural language processing. These datasets, rich in visual and textual information, are fundamental to developing algorithms capable of understanding and interpreting images. They serve as the backbone for enhancing machine learning models, particularly those tasked with deciphering the complex interplay between visual elements in images and their corresponding textual descriptions. A significant challenge in this field is the need for large-scale, accurately annotated datasets. These are essential for training models but are often not publicly accessible, limiting the scope of research and development. The ImageNet and OpenImages datasets, containing human-annotated

Analysis of 1,800 AI datasets: ~70% didn't state what license should be use

Nitasha Tiku / Washington Post: Analysis of 1,800 AI datasets: ~70% didn't state what license should be used or had been mislabeled with more permissive guidelines than their creators intended

Meet SwimXYZ: A Synthetic Dataset of Swimming Motions and Videos Containing

Human motion capture has emerged as a key tool in various industries, including sports, medical, and character animation for the entertainment sector. Motion capture is utilized in sports for multiple purposes, including injury prevention, injury analysis, video game industry animations, and even generating informative visualization for TV broadcasters. Traditional motion capture systems provide solid results in the majority of circumstances. Still, they are expensive and time-consuming to set up, calibrate, and post-process, making them difficult to utilize on a broad scale. These concerns are made worse for aquatic activities like swimming, which bring up unique problems such as marker reflections

Download 222 Belle Époque Art Posters: An Online Archive of Masterpieces fr

Europe at the end of the nineteenth century and beginning of the twentieth: what a time and place to be alive.

Library of Short Stories

Read For Free, Anywhere, Anytime. An online library of over 1000 classic short stories. H. G. Wells, Edgar Allan Poe, H. P. Lovecraft, Anton Chekhov, Beatrix Potter.

Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning

BookCorpus has helped train at least thirty influential language models (including Google’s BERT, OpenAI’s GPT, and Amazon’s Bort), according to HuggingFace. This is the research question that…

MIT Researchers Created a New Annotated Synthetic Dataset of Images that De

Large-scale pre-trained Vision and language models have demonstrated remarkable performance in numerous applications, allowing for the replacement of a fixed set of supported classes with zero-shot open vocabulary reasoning over (nearly arbitrary) natural language queries. However, recent research has revealed a fundamental flaw in these models. For instance, their inability to comprehend Visual Language Concepts (VLC) that extend 'beyond nouns,' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or their difficulty with compositional reasoning, such as comprehending the significance of the word order in a sentence. Vision and language models, powerful machine-learning algorithms that learn

the-markup/xandr-audience-segments

Inside the secret list of websites that make AI like ChatGPT sound smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

How to Anonymise Places in Python

A ready-to-run code which identifies and anonymises places, based on the GeoNames database

11 Less Used but Important Plots for Data Science

Some Unique Data Visualization Techniques for Getting High-Level Insight into the Data

The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentatio

Posted by Mahima Pushkarna, Senior Interaction Designer, and Andrew Zaldivar, Senior Developer Relations Engineer, Google Research As machine learn...

NCAA Statistics

Create geo image dataset in 20 minutes

Build geo specific subset of LAION-5B

Huge new dataset pushes limits of neuroscience

The Allen Institute’s release includes recordings from a whopping 300,000 mouse neurons. Now the challenge is figuring out what to do with all that data.

National Rail Network Map

Home page | College Athletics Database

Introduction to The World of Data - (OLTP, OLAP, Data Warehouses, Data Lake

https://codingvc.com/the-value-of-data-part-1-using-data-as-a-competitive-advantage

10 Data Acquisition Strategies for Startups

The “unreasonable effectiveness” of data for machine-learning applications has been widely debated over the years (see here, here and…

The ArtBench Dataset: Benchmarking Generative Models with Artworks

We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10...

https://codingvc.com/the-value-of-data-part-2-building-valuable-datasets

Mapping Urban Trees Across North America with the Auto Arborist Dataset

Posted by Sara Beery, Student Researcher, and Jonathan Huang, Research Scientist, Google Research, Perception Team Over four billion people live in...

https://codingvc.com/the-value-of-data-part-3-data-business-models/

AllMusic: The Story of the Big Data Jukebox

When AllMusic launched 25 years ago, it wasn't an obvious big data play. But it became one. Hidden in its millions of entries is music's collective history.

The market for synthetic data is bigger than you think

To understand what's happening, but also what's coming if synthetic data does get more broadly adopted, we talked to various CEOs and VCs over the last few months.

93 Datasets That Load With A Single Line of Code

How you can pull one of a few dozen example political, sporting, education, and other frames on-the-fly.

Zip Code Database Lookup | Everything By Zip Code

Our zip code database is a unified view of public datasets like the Census, American Community Survey, Bureau of Labor Statistics and the CDC, spanning 800+ data points, also offering a free zip code database.

Indus

How to Create Fake Data with Faker

You can either Collect Data or Create your Own Data

Curating a Dataset from Raw Images and Videos

Best-practices to follow when building datasets from large pools of image and video data and tools that make it straightforward.

MedMNIST v2 Dataset | Papers With Code

MedMNIST v2 is a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28 x 28 (2D) or 28 x 28 x 28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of 708,069 2D images and 10,214 3D images in total, could support numerous research / educational purposes in biomedical image analysis, computer vision and machine learning. Description and image from: MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification Each subset keeps the same license as that of the source dataset. Please also cite the corresponding paper of source data if you use any subset of MedMNIST.

Laion-400M: open-source dataset of 400M image-text pairs

DatasetGAN

Awesome list of datasets in 100 categories

With an estimated 44 zettabytes of data in existence in our digital world today and approximately 2.5 quintillion bytes of new data generated daily, there is a lot of data out there you could tap into for your data science projects. It's pretty hard to curate through such a massive…

Web Scraping to Create a Dataset using Python

In this article, I'm going to walk you through a tutorial on web scraping to create a dataset using Python and BeautifulSoup.

Datasets should behave like git repositories | by Simon Lousky | Towards Da

Create, maintain, and contribute to a long-living dataset that will update itself automatically across projects.

UCI Machine Learning Repository

Discover datasets around the world!

www-eio.upc.edu/~pau/cms/rdata/datasets.html

UMLS Metathesaurus Browser

This is an interface for searching and browsing the UMLS Metathesaurus data. Our goal here is to present the UMLS Metathesaurus data in a useful way.

The Child Affective Facial Expression (CAFE) set: validity and reliability

Emotional development is one of the largest and most productive areas of psychological research. For decades, researchers have been fascinated by how humans respond to, detect, and interpret emotional facial expressions. Much of the research in this area ...

Explore Census Data

Why Data Standards Matter

The power of join keys and how data standards can make data more valuable and accelerate collaboration and innovation. This is the second installment of the DaaS Bible series.

The history of autonomous vehicle datasets and 3 open-source Python apps fo

Special thanks to Plotly investor, NVIDIA, for their help in reviewing these open-source Dash applications for autonomous vehicle R&D, and Lyft for initial data visualization development in Plotly. Author: Xing Han Lu, @xhlulu (originally posted on Medium) ???? To learn more about how to use Dash for Autonomous Vehicle and AI Applications register for our live webinar with […]

Download GNIS Data | U.S. Geological Survey

Learn about and download U.S. Board on Geographic Names data from the Geographic Names Information System (GNIS)

Automated Data Import with Python

A different approach to import data files automatically in python.

Generating Synthetic Patient Data

A quick look at using Synthea

Time Series Analysis: Creating Synthetic Datasets

How to create time series datasets with different patterns

Millions of tiny databases

The Big Bad NLP Database: Access Nearly 300 Datasets

Check out this database of nearly 300 freely-accessible NLP datasets, curated from around the internet.

Slashdot

A Directory of High Quality, Real-Time Event Sources

Connect APIs, remarkably fast. Free for developers. - PipedreamHQ/pipedream

5 Data Cleansing Tools - DataScienceCentral.com

You need to analyze data to make more informed decisions. There are many tools to help you analyze the data visually or statistically, but they only work if the data is already clean and consistent. Here is the list of 5 data cleansing Tools. Drake Drake is a simple-to-use, extensible, text-based data workflow tool that… Read More »5 Data Cleansing Tools

Rahul Agarwal on LinkedIn: #datavisualization #awesomevisualization #seaborn #python

"Enter into picture Swarmplots, just like their name." https://lttr.ai/MJtZ #datavisualization #awesomevisualization #seaborn #python

The 5 most useful Techniques to Handle Imbalanced datasets

This post is about explaining the various techniques you can use to handle imbalanced datasets

Data USA

The most comprehensive visualization of U.S. public data. Data USA provides an open, easy-to-use platform that turns data into knowledge.

14 Paris Museums Put 300,000 Works of Art Online: Download Classics by Mone

First trips to Paris all run the same risk: that of the museums consuming all of one's time in the city. What those new to Paris need is a museum-going strategy, not that one size will fit all.

vumaasha/Atlas: Atlas: A Dataset and Benchmark for E-commerce Clothing Prod

Atlas: A Dataset and Benchmark for E-commerce Clothing Product Categorization - vumaasha/Atlas

This object-recognition dataset stumped the world’s best computer vision mo

When computer vision detectors are turned loose in the real world, their performance noticeably drops. In an effort to close this performance gap, a team of MIT and IBM researchers set out to create a very different kind of object-recognition dataset called ObjectNet.

Welcome! | Million Song Dataset

cfbstats.com - 2019 National Team Leaders

nfl-football-ops/Big-Data-Bowl: Homepage for the National Football League's

Homepage for the National Football League's Big Data Bowl - nfl-football-ops/Big-Data-Bowl

Baidu Apollo Releases Massive Self-driving Dataset; Teams Up With Berkeley DeepDrive

Baidu this Thursday announced the release of ApolloScape, billed as the world’s largest open-source dataset for autonomous driving…

City & County Information, Town & Community Information - ePodunk

Perfectly Awesome

datasets