DataNamespace

Category:Archives

The problem

2010 Populations with multiracial identifiers
Group2010 PopulationPercentage of Total Population
White22,953,37461.6%
White, not Hispanic or Latino15,763,62542.3%
Hispanic or Latino (of any race)14,013,71937.6%
Mexican11,423,14630.6%
Salvadoran573,9561.5%
Guatemalan332,7370.8%
Puerto Rican189,9450.5%
Spaniard142,1940.3%
Nicaraguan100,7900.2%

Source: Demographics of California

Traffic by calendar year
YearPassengersAircraft MovementsFreight (tons)Mail (tons)
1994 51,050,275689,8881,516,567186,878
1995 53,909,223732,6391,567,248193,747
1996 57,974,559763,8661,696,663194,091
1997 60,142,588781,4921,852,487212,410
1998 61,215,712773,5691,787,400264,473
1999 64,279,571779,1501,884,526253,695
2000 67,303,182783,4332,002,614246,538
2001 61,606,204738,4331,779,065162,629
2002 56,223,843645,4241,869,93292,422
2003 54,982,838622,3781,924,88397,193
2004 60,704,568655,0972,022,91192,402
2005 61,489,398650,6292,048,81788,371
2006 61,041,066656,8422,022,68780,395
2007 62,438,583680,9542,010,82066,707
2008 59,815,646622,5061,723,03873,505
2009 56,520,843544,8331,599,78264,073
2010 59,069,409575,8351,852,79174,034
2011 61,862,052603,9121,789,20480,442
2012 63,688,121605,4801,866,43296,779

Source: Los_Angeles_International_Airport#Traffic_and_statistics

There are thousands of data tables buried inside the body of Wikipedia articles. These tables are generally:

  • hard to reference: how do I cite or refer to a table? If I am lucky there's a fragment/id I can link to in an article, but in general data tables in Wikipedia are not objects that can be referenced in the same way as an image is.
  • hard to discover: for the same reason, it's impossible to obtain a human-readable list of tabular datasets that are included in Wikipedia articles.
  • hard to maintain: we do not provide table-specific versioning, meaning that a change to a dataset (a new row, an existing value modified) is just a regular article revision.
  • impossible to reuse across articles or projects: the same dataset in two articles of the same project or in the same article in two different projects would need to be copied and maintained twice.
  • visualization-unfriendly: we use static images or SVGs for timelines and plots that could be easily generated from a tabular data source.
  • hard to style consistently: templates and various hacks are used for tables to behave in the context of an article.
  • a huge source of pain for VE/parsoid: parsing HTML tables in general, not just data tables, and the templates that are used to render them, is one of the biggest challenges for VisualEditor.

Conversely, we have tons of simple charts (such as those available on the Wikimedia reportcard) that cannot be easily reused or embedded in Wikipedia articles.

A proposal

One of many static barcharts used across Wikipedia

A dedicated namespace for tabular data (represented as delimiter-separated values or JSON) will offer several benefits:

  • revision control: individual datasets will become fully revision controlled and much easier to maintain.
  • citability: each dataset will have a canonical URI (project_id:namespace:page_id) that would make it uniquely identifiable internally (in Wikimedia projects) and externally.
  • reusable: data tables, instead of living inside the body of an article, will be transcluded/embedded via LUA and become reusable across all Wikimedia projects.
  • visualization-ready: tabular data that can be easily embedded into an article will allow us to develop extensions or gadgets in MediaWiki to easily toggle between a tabular view and a chart view, replacing the need of static images or vector graphs.
  • consistently styled: editors can focus on curating the data and selecting a subset of meaningful options for rendering it as a table, instead of bothering with presentation issues. VisualEditor will have one less problem to worry about.
  • metadata: any page associated with a dataset can be used to store metadata, or (even better) the metadata can be stored on Wikidata if the data table exists as an entity in Wikidata.
  • machine readable: a uniquely identifiable object in a dedicated namespace can be exposed and accessed programmatically via the MediaWiki API.

Scope

The (initial) scope of this proposal is limited to:

  • tabular data already existing in Wikipedia articles, not original datasets imported from external sources
  • datasets of a sufficiently small size to be editable and rendered on-wiki (see discussion 1,2)

What about Wikidata?

Most of these motivations are the same used in the rationale for Wikidata, but Wikidata is focused on structured/semantic data, i.e. data that's typically used to express statements like: "entity Q has property P with qualifier R according to source S". With the exception of tables that can be generated as queries against structured data, support for tabular data (i.e. data that can be represented as a barchart or a timeseries) is not within the scope of Wikidata. (discussion)

State of the art

  • We already have JSON namespaces on Meta, with dedicated ContentHandler settings, that are serving various purposes, from hosting data models (e.g. Schema:Edit) to Wikipedia Zero settings (e.g. Zero:250-99)
  • The WMF Multimedia team and Commons community are advocating the use of Wikidata to store media metadata. The same approach could be used to store metadata of tabular datasets.
  • In the Brede Wiki, Finn Årup Nielsen is using ordinary namespace pages to store comma-separated values including one-row header for scientific data, see, e.g., Example on CSV file. This data can then be transcluded on other pages on the wiki, see, e.g., example. The transclusion uses the 'tab' tag from the 'SimpleTable' extension of Johan the Ghost defined in a template, making a static table rendering (except for the standard sortable style). The data from the CSV pages is read by an external script that performs meta-analysis on the data, see, e.g., meta-analysis example. This script also allows for export of the CSV data in JSON format. The 'semantic' annotation of the column header takes place in standard MediaWiki templates, that are aware of the format of the external script API, see, e.g., metaanalysis csv template referenced from BiND metaanalysis section. This simple approach, which requires no modification of a standard installation of MediaWiki beyond the 'SimpleTable' extension enabling, has been described in more detail in a few articles:
Category:Wikidata Category:Proposals
Category:Archives Category:Proposals Category:Wikidata