========================= Some interesting datasets ========================= https://data.world/datasets/global-warming https://dssg.uchicago.edu/projects/ http://kateto.net/2016/05/network-datasets/ https://catalog.data.gov/dataset ===================== Azure workload traces ===================== https://github.com/Azure/AzurePublicDataset These are massive VM traces on Microsoft Azure with CPU usage timeline data. ======================== SNIA I/O workload traces ======================== http://iotta.snia.org/ Traces: http://iotta.snia.org/traces (SBU has access to these traces) Some of the traces are described briefly at: https://www.researchgate.net/profile/Bruce_Worthington/publication/224331805_Characterization_of_storage_workload_traces_from_production_Windows_Servers/links/54105cbe0cf2df04e75d4e83.pdf These are storage workload traces that contain multiple fields such as bytes written, read, timestamp, etc. Each trace could be one whole project. The dataset is in raw format and will likely require some processing. The data itself contains several columns, only some of which are likely useful. Hypotheses and models could involve distribution of time between requests, time for completion, size of data accessed, etc. ============================ WITS Internet traffic traces ============================ https://wand.net.nz/wits/catalogue.php These are internet traffic traces from varioud sources. These are typically large traces that will require some non-trivial data processing. Hypotheses and models could involve distribution of time between requests, request rate, source/destination pairs, etc. ======================== Google cluster traces ======================== https://github.com/google/cluster-data Trace: https://github.com/google/cluster-data/blob/master/ClusterData2011_2.md Some reference papers analyzing these traces: http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/ISTC-CC-TR-12-101.pdf and https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-95.pdf This is a huge trace from a cluster at Google containing information about jobs submitted at that cluster over a period of time. The dataset is large and will require some non-trivial processing. Hypotheses and models could involve distribution of resource usage, resource requested, server characteristics, etc. ====================== Facebook Hadoop traces ====================== https://github.com/SWIMProjectUCB/SWIM/wiki/Workloads-repository Similar to Google traces, contains server traces for Hadoop jobs from a Facebook cluster. Contains information about job submission time and input/output data size. ================= IaaS cloud traces ================= https://www.cs.ucsb.edu/~rich/workload/ (README included) These are anonymized traces from multiple production systems running the Eucalyptus private IaaS cloud. Contains data on start and stop times for VMs, their size, and their host placement. ======================= Wikipedia access traces ======================= http://www.wikibench.eu/?page_id=60 Contains information about requests for wikipedia pages over a certain time period for a certain data hosting region. Dataset contains timestamp of request and some other information that maybe useful for analysis. ===================== Other dataset sources ===================== NOAA datasets: https://www.ncdc.noaa.gov/cdo-web/ and https://www.ncdc.noaa.gov/homr/ Election datasets: https://catalog.data.gov/dataset?tags=elections Some more sources: https://www.springboard.com/blog/free-public-data-sets-data-science-project/ In each of these cases, the dataset has potential for interesting analyses and hypotheses. However, it will be the responsibility of the groups to validate the availability of data and form appropriate hypotheses.