用于测试和原型设计的公共数据集Public data sets for testing and prototyping

浏览公共数据集的这个列表,其其中是否存在可用于设计存储和分析服务及解决方案的原型并进行测试的数据。Browse this list of public data sets for data that you can use to prototype and test storage and analytics services and solutions.

美国政府和机构数据U.S. Government and agency data

数据源Data source 关于数据About the data 关于文件About the files
美国政府数据US Government data 超过 250,000 个数据集,涵盖了美国的农业、气候、消费者、生态系统、教育、能源、金融、保健、地方政府、制造业、海运、海洋、公共安全和科研方面的数据。Over 250,000 data sets covering agriculture, climate, consumer, ecosystems, education, energy, finance, health, local government, manufacturing, maritime, ocean, public safety, and science and research in the U.S. 各种大小的文件,采用 HTML、XML、CSV、JSON、Excel 等格式。Files of various sizes in various formats including HTML, XML, CSV, JSON, Excel, and many others. 可按文件格式筛选可用数据集。You can filter available data sets by file format.
美国人口普查数据US Census data 美国人口的统计数据Statistical data about the population of the U.S. 数据集采用各种格式。Data sets are in various formats.
来自 NASA 的地球科学数据Earth science data from NASA 32,000 多个数据集,涵盖了农业、大气、生物圈、气候、低温层、人文领域、水圈、地表、海洋、太阳与地球相互作用等方面的数据。Over 32,000 data collections covering agriculture, atmosphere, biosphere, climate, cryosphere, human dimensions, hydrosphere, land surface, oceans, sun-earth interactions, and more. 数据集采用各种格式。Data sets are in various formats.
航班延迟和其他交通数据Airline flight delays and other transportation data “美国运输部 (DOT) 运输统计局 (BTS) 对大型航空公司运营的国内航班的准时情况进行了跟踪。"The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. 可在此网站发布的汇总表中了解准时的、延迟的和取消的航班及转机航班数的汇总信息。”Summary information on the number of on-time, delayed, canceled, and diverted flights appears ... in summary tables posted on this website." 文件为 CSV 格式。Files are in CSV format.
交通死亡事故 - 美国事故分析报告系统 (FARS)Traffic fatalities - US Fatality Analysis Reporting System (FARS) “FARS 是全国性的普查,可提供 NHTSA、国会和美国公众就机动车辆交通事故造成的致命事故公开的年度数据。”"FARS is a nationwide census providing NHTSA, Congress, and the American public yearly data regarding fatal injuries suffered in motor vehicle traffic crashes." “使用 FARS 查询系统自己创建在线运行的死亡数据。"Create your own fatality data run online by using the FARS Query System. 或从 FTP 站点下载自 1975 起的所有 FARS 数据。”Or download all FARS data from 1975 to present from the FTP Site."
有毒化学物质数据 - EPA 毒性预测 (ToxCast™) 数据Toxic chemical data - EPA Toxicity ForeCaster (ToxCast™) data “EPA 可公开提供最近更新的数千种化学品的高通量毒性数据。"EPA's most updated, publicly available high-throughput toxicity data on thousands of chemicals. 该数据由 EPA 的 ToxCast 研究得出。”This data is generated through the EPA's ToxCast research effort." 存在各种格式的数据集,包括电子表格、R 包和 MySQL 数据库文件。Data sets are available in various formats including spreadsheets, R packages, and MySQL database files.
有毒化学物质数据 - NIH Tox21 数据挑战 2014Toxic chemical data - NIH Tox21 Data Challenge 2014 “2014 Tox21 数据挑战旨在帮助科学家了解通过 21 世纪毒理学进行测试的化学物质和化合物的潜力,以可能造成毒性反应的方法主动打破生物学路径。”"The 2014 Tox21 data challenge is designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects." 数据集格式为 SMILES 和 SDF。Data sets are available in SMILES and SDF formats. 该数据可提供“Tox21 收集的约 10,000 种化合物 (Tox21 10K) 的测定活性数据和化学结构。”The data provides "assay activity data and chemical structures on the Tox21 collection of ~10,000 compounds (Tox21 10K)."
NCBI 提供的生物技术和基因组数据Biotechnology and genome data from the NCBI 多个数据集,涵盖了基因、基因组和蛋白质的数据。Multiple data sets covering genes, genomes, and proteins. 数据集为文本、XML、BLAST 等格式。Data sets are in text, XML, BLAST, and other formats. 可使用 BLAST 应用。A BLAST app is available.

其他统计和科学类数据Other statistical and scientific data

纽约市出租车数据New York City taxi data “出租车行程记录包括捕获以下信息的字段:上车和下车日期/时间、上车和下车位置、行程距离、逐条列记的车费、费率类型、付款类型和司机报告的乘客数。”"Taxi trip records include fields capturing pick-up and dropoff dates/times, pick-up and dropoff locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts." 数据集文件以 CSV 格式按月提供。Data sets are in CSV files by month.
Microsoft Research 数据集 -“研究型数据科学”Microsoft Research data sets - "Data Science for Research" 多个数据集,涵盖了人机交互、音频/视频、数据挖掘/信息检索、地理空间/位置、自然语言处理和机器人/计算机视觉。Multiple data sets covering human-computer interaction, audio/video, data mining/information retrieval, geospatial/location, natural language processing, and robotics/computer vision. 数据集有各种格式,可压缩后下载。Data sets are in various formats, zipped for download.
Open Science Data Cloud 数据Open Science Data Cloud data “Open Science Data Cloud 为科学界提供了可存储、共享和分析 TB 级和 PB 级科学数据集的资源。”"The Open Science Data Cloud provides the scientific community with resources for storing, sharing, and analyzing terabyte and petabyte-scale scientific datasets." 数据集采用各种格式。Data sets are in various formats.
全球气候数据 - WorldClimGlobal climate data - WorldClim “WorldClim 是一组全球气候层(网格气候数据),空间分辨率约为 1 平方千米。"WorldClim is a set of global climate layers (gridded climate data) with a spatial resolution of about 1 km2. 这些数据可用于映射和空间建模。”These data can be used for mapping and spatial modeling." 这些文件包含地理空间数据。These files contain geospatial data. 有关详细信息,请参阅数据格式For more info, see Data format.
关于人类社会的数据 - GDELT 项目Data about human society - The GDELT Project “GDELT 项目是目前为止创建的有关人类社会的最大、最全面、分辨率最高的开放数据库。”"The GDELT Project is the largest, most comprehensive, and highest resolution open database of human society ever created." 原始数据文件为 CSV 格式。The raw data files are in CSV format.
来自 Criteo 的机器学习广告点击预测数据Advertising click prediction data for machine learning from Criteo “公开发布的最大的 ML 数据集。”"The largest ever publicly released ML dataset." 有关详细信息,请参阅 Criteo's 1 TB Click Prediction Dataset(Criteo 的 1 TB 点击预测数据集)。For more info, see Criteo's 1 TB Click Prediction Dataset.
来自 The Lemur Project 的 ClueWeb09 文本挖掘数据集ClueWeb09 text mining data set from The Lemur Project “创建 ClueWeb09 数据集是为了支持与信息检索和相关人类语言技术相关的研究。"The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. 由 2009 年 1 月和 2 月收集的约 10 亿个 10 种语言的网页组成。”It consists of about 1 billion web pages in 10 languages that were collected in January and February 2009." 请参阅 Dataset Information(数据集信息)。See Dataset Information.

联机服务数据Online service data

GitHub ArchiveGitHub archive “GitHub Archive 是一个用于记录事件公共 GitHub 时间轴,将其存档,并使其易于进行进一步分析的项目。”"GitHub Archive is a project to record the public GitHub timeline [of events], archive it, and make it easily accessible for further analysis." 从 Web 客户端下载 .gz (Gzip) 格式的以 JSON 编码的事件存档。Download JSON-encoded event archives in .gz (Gzip) format from a web client.
来自 GHTorrent 项目的 GitHub 活动数据GitHub activity data from The GHTorrent project “GHTorrent 项目正努力创建一种通过 GitHub REST API 提供的、可缩放、可查询的脱机镜像数据。"The GHTorrent project [is] an effort to create a scalable, queryable, offline mirror of data offered through the GitHub REST API. GHTorrent 可监视 GitHub 公共事件时间轴。GHTorrent monitors the GitHub public event time line. 它会彻底检索每个事件的内容及其依赖项。”For each event, it retrieves its contents and their dependencies, exhaustively." MySQL 数据库转储采用 CSV 格式。MySQL database dumps are in CSV format.
Stack Overflow 数据转储Stack Overflow data dump “这是指将用户贡献的所有内容匿名转储在 Stack Exchange 网络(包括 Stack Overflow)上。”"This is an anonymized dump of all user-contributed content on the Stack Exchange network [including Stack Overflow]." “每个站点(例如 Stack Overflow)都被格式化为一个单独的存档,其中包含通过 7-zip 使用 bzip2 压缩的 XML 文件。"Each site [such as Stack Overflow] is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. 每个站点存档都包括帖子、用户、投票、评论、发布历史和发布链接。”Each site archive includes Posts, Users, Votes, Comments, PostHistory, and PostLinks."