대규모 공개 데이터 세트?

developer tip

대규모 공개 데이터 세트?

optionbox 2020. 11. 21. 14:14

대규모 공개 데이터 세트?

특히 다음과 같은 대규모 공개 데이터 세트를 찾고 있습니다.

익명화 된 대규모 샘플 웹 서버 로그.
데이터베이스 성능 벤치마킹에 사용되는 데이터 세트.

대규모 공개 데이터 세트에 대한 다른 링크는 감사하겠습니다. http://aws.amazon.com/publicdatasets/ 에서 Amazon의 공개 데이터 세트에 대해 이미 알고 있습니다.

1. 익명화 된 대규모 샘플 웹 서버 로그.

다음과 같이 작동합니다.

UCI 기계 학습 저장소

이보다 더 많은 데이터 세트를 사용할 수 있지만 (다른 답변의 범위 참조) 이것은 원래 기준을 충족하는 가장 낮은 매달려있는 과일입니다. 보너스로, 그들이 알고있을 수있는 특정 요구 사항이있는 경우 연락처 링크 가 있습니다.

2. 데이터베이스 성능 벤치마킹에 사용되는 데이터 세트.

잘 정의 된 알고리즘 문제 를 설명하는 경험적 데이터 세트를 요구하기 때문에 잘못된 이름처럼 들립니다 . 특히, 다양한 데이터베이스 시스템을 실시간으로 테스트하고 벤치마킹하는 데 사용할 수있는 데이터 세트를 찾으려고하는 것처럼 들립니다.이를 결정하기위한 테스트 케이스 세트로 사용할 수있는 잘 정의되고 정규화 된 관계형 데이터를 사용하여 귀하의 요구를 충족하는 가장 효율적인 솔루션입니다.

나는이 접근 방식에 동의하지 않습니다. 수많은 데이터베이스 시스템과 미리 준비된 구현을 찾는 대신 이러한 시스템 의 알고리즘 보장 을 첫 번째 호출 포트로 탐색하는 것이 훨씬 좋습니다 . 요구 사항을 충족하는 알고리즘 제약 조건을 결정한 후에는 인덱싱, 정렬, 검색, 삽입, 삭제 및 검색과 같은 효율성을 벤치마킹 할 수있는 미리 준비된 솔루션 집합을 연마 할 수 있습니다.

Wikipedia는 성능 벤치마킹을위한 테스트 케이스를 결정하고 작성하는 데 사용할 수있는 데이터베이스 테스트 개념에 대한 간결한 기사를 제공 합니다 . 예를 들어 JDBC 및 JDBC 벤치 마크 와 같은 독립적 인 데이터 액세스 인터페이스 를 사용하여 각 작업의 상대적 타이밍을 결정할 수 있습니다. 여기에서 올바른 솔루션을 찾을 수 있습니다.

요컨대, 데이터베이스 보장을 결정하기 위해 먼저 연구 로 이동 하십시오 . 후보 솔루션 세트가 식별되면 원하는 각 작업의 일정한 시간 성능을 테스트 (또는 결정)하여 솔루션 중에서 선택할 수 있습니다.

Quora 답변 과 내 연구의 개인 컬렉션을 기반으로 멋진 공개 데이터 세트 저장소가 생성되고 GitHub에서 활발하게 업데이트되었습니다.

아래는이 목록의 스냅 샷 버전입니다. 최신 목록을 보려면 Github 를 방문하십시오 .

이 공개 데이터 소스 목록은 블로그, 답변 및 사용자 응답에서 수집 및 정리됩니다. 아래 나열된 대부분의 데이터 세트는 무료이지만 일부는 그렇지 않습니다. 이 목록은 https://github.com/caesar0301/awesome-public-datasets 에서 가져 왔습니다 .

기후

호주 날씨 : http://www.bom.gov.au/climate/dwo/
기후 데이터 : http://www.cru.uea.ac.uk/cru/data/ temperature / # datter 및 ftp://ftp.cmdl.noaa.gov/
1929 년 이후의 지구 기후 데이터 : http://www.tutiempo.net/en/Climate
NOAA 베링해 기후 : http://www.beringclimate.noaa.gov/
NOAA 기후 데이터 세트 : http://ncdc.noaa.gov/data-access/quick-links
전세계 WU 역사적 날씨 : http://www.wunderground.com/history/index.html

경제학

American Economic Ass. (AEA) : http://www.aeaweb.org/RFE/toc.php?show=complete
EconData (UMD) : http://inforumweb.umd.edu/econdata/econdata.html
인터넷 제품 코드 데이터베이스 : http://www.upcdatabase.com/
세계 은행 : http://data.worldbank.org/indicator

재원

CBOE 선물 거래소 : http://cfe.cboe.com/Data/
Google 금융 : https://www.google.com/finance
Google 트렌드 : http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
NASDAQ : https://data.nasdaq.com/
OANDA : http://www.oanda.com/
OSU 재무 데이터 : http://fisher.osu.edu/fin/osudata.htm
Quandl : http://www.quandl.com/
세인트루이스 연방 : http://research.stlouisfed.org/fred2/
Yahoo Finance : http://finance.yahoo.com/

생물학

CRCNS : http://crcns.org/data-sets
유전자 발현 옴니버스 : http://www.ncbi.nlm.nih.gov/geo/
인간 미생물 군유 전체 프로젝트 : http://www.hmpdacc.org/reference_genomes/reference_genomes.php
MIT 암 유전체학 데이터 : http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NIH 마이크로 어레이 데이터 : ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
단백질 구조 : http://www.infobiotic.net/PSPbenchmarks/
공개 유전자 데이터 : http://www.pubgene.org/
Stanford Microarray 데이터 : http://smd.stanford.edu/
UniGene : http://www.ncbi.nlm.nih.gov/unigene

물리학

NASA : http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html

보건 의료

EHDP 대규모 건강 데이터 세트 : http://www.ehdp.com/vitalnet/datasets.htm
Gapminder : http://www.gapminder.org/data/
Medicare 데이터 파일 : http://go.cms.gov/19xxPN4

GeoSpace

EOSDIS : http://sedac.ciesin.columbia.edu/data/sets/browse
실제 글로벌 위치 데이터 : http://www.factual.com/
지리 공간 데이터 : http://geodacenter.asu.edu/datalist/

교통

항공사 데이터 (2009 ASA 챌린지) : http://stat-computing.org/dataexpo/2009/the-data.html
공항 및 위치 : http://www.infochimps.com/datasets/airports-and-their-locations
자전거 공유 데이터 시스템 : https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
미국 국내선 1990 년부터 2009 년까지의 에지 데이터 : http://data.memect.com/?p=229
50 만 번의 Hubway 타기 : http://hubwaydatachallenge.org/trip-history-data/
NYC Taxi Trip Data 2013 (FOIA / FOIL) : https://archive.org/details/nycTaxiTripData2013
OpenFlights (공항, 항공사 및 노선 데이터) : http://openflights.org/data.html
RITA Airline 정시 성능 데이터 : http://www.transtats.bts.gov/Tables.asp?DB_ID=120
RITA 전송 데이터 수집 : http://www.transtats.bts.gov/DataIndex.asp
런던 교통 : http://www.tfl.gov.uk/info-for/open-data-users/our-feeds
미국화물 분석 프레임 워크 : http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm

정부

아카이브-잇 : : https://www.archive-it.org/explore?show= 컬렉션
호주 : http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
캐나다 : http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
시카고 : https://data.cityofchicago.org/
FDA : https://open.fda.gov/index.html
연준 통계 : http://www.fedstats.gov/cgi-bin/A2Z.cgi
가디언 세계 정부 : http://www.guardian.co.uk/world-government-data
HUD : http://www.huduser.org/portal/datasets/pdrdatas.html
영국 런던 데이터 스토어 : http://data.london.gov.uk/dataset
뉴질랜드 : http://www.stats.govt.nz/browse_for_stats.aspx
NYC betanyc : http://betanyc.us/
NYC 오픈 데이터 : http://nycplatform.socrata.com/
OECD : http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
RITA : http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
샌프란시스코 데이터 세트 : http://datasf.org/
세계 은행 : http://wdronline.worldbank.org/
영국 정부 데이터 : http://data.gov.uk/data
미국 인구 조사국 : http://www.census.gov/data.html
미국 연방 정부 기관 : http://www.data.gov/metric
미국 연방 정부 데이터 카탈로그 : http://catalog.data.gov/dataset
US Open Government : http://www.data.gov/open-gov/
영국 2011 인구 조사 오픈 아틀라스 프로젝트 : http://www.alex-singleton.com/2011-census-open-atlas-project/
유엔 : http://data.un.org/
미국 CDC 공중 보건 데이터 세트 : http://www.cdc.gov/nchs/data_access/ftp_data.htm

데이터 문제

기계 학습의 과제 : http://www.chalearn.org/
ICWSM 데이터 챌린지 (2009 년 이후) : http://icwsm.cs.umbc.edu/
Kaggle 경쟁 데이터 : http://www.kaggle.com/
Tencent 2012의 KDD 컵 : https://www.kddcup2012.org/
넷플릭스 상 : http://www.netflixprize.com/leaderboard
Yelp 데이터 세트 챌린지 : http://www.yelp.com/dataset_challenge

기계 학습

eBay 온라인 경매 : http://www.modelingonlineauctions.com/datasets
IMDb 데이터베이스 : http://www.imdb.com/interfaces
Keel 저장소 : http://sci2s.ugr.es/keel/datasets.php
대출 클럽 대출 데이터 : https://www.lendingclub.com/info/download-data.action
기계 학습 데이터 세트 저장소 : http://mldata.org/
백만 곡 데이터 셋 : http://blog.echonest.com/post/3639160982/million-song-dataset
더 많은 노래 데이터 세트 : http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens 데이터 세트 : http://datahub.io/dataset/movielens
RDataMining R 및 데이터 마이닝 ebook 데이터 : http://www.rdatamining.com/data
지구에 등록 된 운석 : http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
SF 레스토랑 데이터 세트 : http://missionlocal.org/san-francisco-restaurant-health-inspections/
UCI 기계 학습 저장소 : http://archive.ics.uci.edu/ml/
University of Toronto Delve 데이터 세트 : http://www.cs.toronto.edu/~delve/data/datasets.html
Yahoo 등급 및 분류 데이터 : http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

자연어

컨텍스트 내 4 천만 개 항목 : https://code.google.com/p/wiki-links/downloads/list
ClueWeb09 FACC : http://lemurproject.org/clueweb09/FACC1/
ClueWeb12 FACC : http://lemurproject.org/clueweb12/FACC1/
Flickr 개인 분류 : http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
Google 도서 Ngram : http://aws.amazon.com/datasets/8172056142375670
Google Web 5gram, 2006 (1T) : https://catalog.ldc.upenn.edu/LDC2006T13
구텐베르크 전자 책 목록 : http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
Hansards : http://www.isi.edu/natural-language/download/hansard/
기계 번역 : http://statmt.org/wmt11/translation-task.html#download
SMS 스팸 수집 : http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
USENET 코퍼스 : http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
워드 넷 : http://wordnet.princeton.edu/wordnet/download/

이미지 처리

2GB 고양이 사진 : http://bit.do/UJZZ
얼굴 인식 벤치 마크 : http://www.face-rec.org/databases/
ImageNet : http://www.image-net.org/

시계열

시계열 데이터 라이브러리 : https://datamarket.com/data/list/?q=provider:tsdl
UC 리버 사이드 시계열 : http://www.cs.ucr.edu/~eamonn/time_series_data/

사회 과학

중국 호텔 체크인 / 체크 아웃 데이터 : http://www.360doc.com/content/13/1105/13/7863900_326788919.shtml
CMU Enron 이메일 : http://www.cs.cmu.edu/~enron/
Facebook 소셜 네트워크 (2007 년부터) : http://law.di.unimi.it/datasets.php
Facebook100 (2005) : https://archive.org/details/oxford-2005-facebook-matrix
Foursquare (2010,2011) : http://www.public.asu.edu/~hgao16/dataset.html
Foursquare (UMN / Sarwat, 2013) : https://archive.org/details/201309_foursquare_dataset_umn
일반 사회 조사 (GSS) : http://www3.norc.org/GSS+Website/
GetGlue (TV 프로그램 평가 사용자) : http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz
GitHub 아카이브 : http://www.githubarchive.org/
ICPSR : http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
모바일 소셜 네트워크 (UMASS) : https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
퓨 리서치 인터넷 프로젝트 : http://www.pewinternet.org/datasets/pages/2/
소셜 네트워킹 : http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
SourceForge 그래프 : http://www.nd.edu/~oss/Data/data.html
타이타닉 서바이벌 데이터 세트 : https://github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
트위터 그래프 : http://an.kaist.ac.kr/traces/WWW2010.html
UC Berkeley의 D-Lab Achive : http://ucdata.berkeley.edu/
UCLA 사회 과학 데이터 아카이브 : http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UNIMI 소셜 네트워크 데이터 셋 : http://law.di.unimi.it/datasets.php
전세계 대학 : http://univ.cc/
취업 연구를위한 UPJOHN : http://www.upjohn.org/erdc/erdc.html
Yahoo Graph 및 소셜 데이터 : http://webscope.sandbox.yahoo.com/catalog.php?datatype=g
Youtube Graph (2007,2008) : http://netsg.cs.sfu.ca/youtubedata/

복잡한 네트워크

CrossRef DOI URL : https://archive.org/details/doi-urls
DBLP 인용 데이터 세트 : https://kdl.cs.umass.edu/display/public/DBLP
NBER 특허 인용 : http://nber.org/patents/
NIST 복잡한 네트워크 데이터 수집 : http://math.nist.gov/~RPozo/complex_datasets.html
단백질-단백질 상호 작용 네트워크 : http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
PyPI 및 Maven 종속성 네트워크 : http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
Scopus 인용 데이터베이스 : http://www.elsevier.com/online-tools/scopus
Stanford GraphBase (Steven Skiena) : http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml
Stanford 대규모 네트워크 데이터 세트 컬렉션 : http://snap.stanford.edu/data/
Koblenz 네트워크 컬렉션 : http://konect.uni-koblenz.de/
UCI 네트워크 데이터 저장소 : http://networkdata.ics.uci.edu/resources.php
UFL 희소 행렬 컬렉션 : http://www.cise.ufl.edu/research/sparse/matrices/
UNIMI 대형 웹 그래프 : http://law.di.unimi.it/datasets.php
WSU 그래프 데이터베이스 : http://www.eecs.wsu.edu/mgd/gdb.html

컴퓨터 네트워크

3.5B 웹 페이지 : http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
53.5B 웹 클릭 : http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
CAIDA 인터넷 데이터 세트 : http://www.caida.org/data/overview/
ClueWeb09 : http://lemurproject.org/clueweb09/
ClueWeb12 : http://lemurproject.org/clueweb12/
CommonCrawl 웹 데이터 : http://commoncrawl.org/the-data/get-started/
Dartmouth CRAWDAD 무선 데이터 세트 : http://crawdad.cs.dartmouth.edu/
OpenMobileData (MobiPerf) : https://console.developers.google.com/storage/openmobiledata_public/
UCSD 네트워크 망원경 : http://www.caida.org/projects/network_telescope/

데이터 SE

아카데믹 급류 : http://academictorrents.com/
Datahub.io : http://datahub.io/dataset
DataMarket : https://datamarket.com/data/list/?q=all
Harvard Dataverse: http://thedata.harvard.edu/dvn/
Statista: http://www.statista.com/
Freebase: http://www.freebase.com/

Public Doamins

Amazon: http://aws.amazon.com/datasets
Archive.org Datasets: https://archive.org/details/datasets
CMU JASA data archive: http://lib.stat.cmu.edu/jasadata/
CMU StatLab collections: http://lib.stat.cmu.edu/datasets/
Data360: http://www.data360.org/index.aspx
Datamob.org: http://datamob.org/datasets
Google: http://www.google.com/publicdata/directory
infochimps: http://www.infochimps.com/
KDNuggets Data Collections: http://www.kdnuggets.com/datasets/index.html
Numbray: http://numbrary.com/
RevolutionAnalytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
Stats4Stem R data sets: http://www.stats4stem.org/data-sets.html
StatSci.org: http://www.statsci.org/datasets.html
The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
UCLA SOCR data collection: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
UFO Reports: http://www.nuforc.org/webreports.html
Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
Yahoo Webscope: http://webscope.sandbox.yahoo.com/catalog.php

Complementary Collections

DataWrangling: http://www.datawrangling.com/some-datasets-available-on-the-web
Inside-r: http://www.inside-r.org/howto/finding-data-internet
Quora: http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
StaTrek: http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/

Here are several. Have fun.

http://archive.ics.uci.edu/ml/

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1

http://crawdad.org/

http://data.austintexas.gov

http://data.cityofchicago.org

http://data.govloop.com

http://data.gov.uk/

http://data.medicare.gov

http://data.seattle.gov

http://data.sfgov.org

http://data.sunlightlabs.com

https://datamarket.azure.com/

http://ftp.ncbi.nih.gov/

http://gettingpastgo.socrata.com

http://books.google.com/ngrams/

http://linkeddata.org/

http://medihal.archives-ouvertes.fr

http://public.resource.org/

http://rechercheisidore.fr

http://reddit.com/r/datasets

http://timetric.com/public-data/

http://www2.jpl.nasa.gov/srtm

http://www.bls.gov/

http://www.crunchbase.com/

http://www.dartmouthatlas.org/

http://www.data.gov/

http://www.datakc.org

http://www.factual.com/

http://www.freebase.com/

http://www.infochimps.com

http://www.kaggle.com/

http://build.kiva.org/

http://www.imdb.com/interfaces

http://dbpedia.org

Just a thought:

USGS Geographic Names database
USDA PLANTS checklist
Any one of the many state GIS repositories e.g. NH's GRANIT

Well for the web server logs you could always just generate them for the format you need. If you are going to test code against it etc. it will have to be tailored to the fields you want to store/parse.

For the datasets used for database performance benchmarking, you'll probably want to look at a tool that can generate data for you. Red Gate has a great one for not too much money.

Google Fusion Tables has a few.

http://tables.googlelabs.com/

Datasets available here as well.

Kaggle.com frequently has datamining challenges. The datasets cover a wide range of fienlds: healthcare provider data to credit history information. Perhaps something there is what you're after.

http://Quandl.com has over 10 million data sets gleaned from all over the internet. The great thing about this resource is that it gives a single way to access all of the data. The site has a free Excel plug in or there are libraries in R, Python, Ruby, etc.

http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public

I am surprised no one mentioned Google N-Grams. More on N-Grams at http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Perhaps some databases used as training sets for face recognition algorithms: face-rec.org

Well, this one is new and there is a challenge behind it:

Million song dataset challenge

참고URL : https://stackoverflow.com/questions/381806/large-public-datasets

'developer tip' 카테고리의 다른 글

WebRTC AGC (자동 이득 제어) (0)	2020.11.22
JAAS 인증 확인을 Shiro에 위임하려면 어떻게해야합니까? (0)	2020.11.22
HttpRuntime.Cache와 HttpContext.Current.Cache의 차이점은 무엇입니까? (0)	2020.11.21
Python 목록의 기본 데이터 구조는 무엇입니까? (0)	2020.11.21
디버그 모드에없는 릴리스 버전의 버그에 대한 일반적인 이유 (0)	2020.11.21

현재글대규모 공개 데이터 세트?

optionbox

대규모 공개 데이터 세트?

대규모 공개 데이터 세트?

기후

경제학

재원

생물학

물리학

보건 의료

GeoSpace

교통

정부

데이터 문제

기계 학습

자연어

이미지 처리

시계열

사회 과학

복잡한 네트워크

컴퓨터 네트워크

데이터 SE

Public Doamins

Complementary Collections

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

대규모 공개 데이터 세트?

대규모 공개 데이터 세트?

기후

경제학

재원

생물학

물리학

보건 의료

GeoSpace

교통

정부

데이터 문제

기계 학습

자연어

이미지 처리

시계열

사회 과학

복잡한 네트워크

컴퓨터 네트워크

데이터 SE

Public Doamins

Complementary Collections

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바