July 18, 2019

Free dataset archive helps researchers quickly find a needle in a haystack

Let's say you're doing research that requires millions of geotagged tweets. Or perhaps you're a journalist who wants to map murders in Chicago from 2001 to the present. You need to find large spatio-temporal datasets—but where?

While there are hundreds of publicly available datasets, locating them can take months of searching. When potential sources are found, they rarely provide enough information for a researcher to decide if the set actually contains the kind of data they need without downloading the often huge file and sorting through it first.

Thanks to a computer scientist at the University of California, Riverside, finding the right dataset is now as easy as bookmarking a website, and it costs absolutely nothing.

Ahmed Eldawy, an assistant professor of computer science in the Marlan and Rosemary Bourns College of Engineering, and his group spent the last three years combing the internet for public spatio-temporal datasets, studying their attributes, and summarizing the results for each set on interactive maps that show the user exactly what they're getting.

"People who work on data science need datasets but can spend a lot of time finding them," Eldawy said. "I wanted to build an archive they can find easily."

Called the UCR Spatio-temporal Active Repository, or UCR STAR, the archive is made available as a service to the research community to provide easy access to large spatio-temporal datasets through an interactive exploratory interface. Users can search and filter those datasets as if shopping for their research, except that everything is free.

"The map interface visualizes the data, so you can see if it's a good fit," Eldawy said. "It's like a catalog for datasets."

At the heart of UCR STAR, the map provides an interactive exploratory interface for the dataset. Similar to Google Maps or other web maps, users can zoom in and out and pan around to get a quick overview of the data distribution, coverage, and accuracy.

Important details are displayed once a dataset is selected, such as the original homepage, a link to the original download source, size in bytes, number of records, file format, and other useful information. The subset download feature allows users to quickly download the data in a given geographical region, which reduces the download size. They can also embed their customized view on a webpage or share the link via social media and bookmark it to revisit later.

UCR STAR contains 102 datasets and 5 billion records. The datasets were mapped using Da Vinci, an open source framework built on top of Apache Spark that Eldawy designed to work with spatial data. The UCR STAR website is best accessed through a desktop browser but also has a limited mobile-friendly interface.

More information: UCR STAR: star.cs.ucr.edu/

Provided by University of California - Riverside

Citation: Free dataset archive helps researchers quickly find a needle in a haystack (2019, July 18) retrieved 17 July 2024 from https://techxplore.com/news/2019-07-free-dataset-archive-quickly-needle.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Google AI research scientist announces Dataset Search

Feedback to editors

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

14 hours ago

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

16 hours ago

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

18 hours ago

Large language models make human-like reasoning mistakes, researchers find

19 hours ago

Unveiling a new class of synthetic fuels

19 hours ago

Microsoft unveils software that allows LLMs to work with spreadsheets

19 hours ago

New technique to assess a general-purpose AI model's reliability before it's deployed

20 hours ago

New system enables intuitive teleoperation of a robotic manipulator in real-time

23 hours ago

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

Jul 16, 2024

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Jul 15, 2024

Load comments (0)

Free dataset archive helps researchers quickly find a needle in a haystack

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Google AI research scientist announces Dataset Search

MorphoNet offers an interactive way to explore the bioimaging data revolution

Drag-and-drop data analytics

New stats apps show a virtual reality

Wide-Open accelerates release of scientific data by identifying overdue datasets

New tool integrates diverse single-cell datasets, aids definition of cell types

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Large language models make human-like reasoning mistakes, researchers find

A new neural network makes decisions like a human would

Phys.org

Medical Xpress

Science X

Free dataset archive helps researchers quickly find a needle in a haystack

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Related Stories

Google AI research scientist announces Dataset Search

MorphoNet offers an interactive way to explore the bioimaging data revolution

Drag-and-drop data analytics

New stats apps show a virtual reality

Wide-Open accelerates release of scientific data by identifying overdue datasets

New tool integrates diverse single-cell datasets, aids definition of cell types

Recommended for you

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Large language models make human-like reasoning mistakes, researchers find

A new neural network makes decisions like a human would

Your Privacy