Wednesday, February 1, 2017

Let the chaos begin

Don't worry, I won't talk about world politics - I am referring to a chaos starting on Wikipedia about the geographic items in Thailand. Just recently, one of the almost completely bot-filled Wikipedia in Cebuano language gets flooded with new article on geographic locations in Thailand, translated from the items in the geonames website. Which in turn imported a lot of their items from the GEOnet Names Server (GNS). The big problem is that both databases contain a lot of bogus entries, especially when in comes to the lower level of the administrative subdivision. Though I bet no human will ever read these articles - why should someone from the Philippines speaking only this local language ever look for a Thai village - the big problem with these articles is that they now need to be linked to the real world with Wikidata.

To give just a few examples of the mess
  • The principal town of Phunphin district in Surat Thani province is named Tha Kham, but on geonames it was wrongly named Phunphin. And to make it worse, different spellings including the term "Amphoe" were listed as alternative names, so now the bot-created page mixes up things about the district and the town.
  • Geonames has an entry named "Ban Talat Yai" as a populated page in Phuket. However, there is no "Ban" with that name there, but a subdistrict. So the bot created article is bogus, linking it to the subdistrict in Wikidata would be wrong. Nearby Kathu is even worse - there's a municipality with that name (which would fit to the category populated place I suppose), as well as a subdistrict, but the geonames entry had "Tambon Kathu" as an alternative name mixing both items.
  • Geonames has two entries for the Ta Phraya district, Sa Kaeo province (1949382 and 1605741), and sadly I am not able to delete the second one there due to insufficient user rights. And of course the bot created now two articles about the same item - one and two.
It is just lucky I already added all the geonames IDs of the districts and provinces to Wikidata, so at least those bot-created articles can be matched automatically.

And to make it worse, it was not only the populated places and subdivisions which were imported, but also all the hills, caves, lakes, rivers, etc. Cleaning up the mess regarding the towns alone already would keep me busy for weeks at least... The only small positive side to have articles on each entity in one Wikipedia - that will make it impossible to wrongly merge together Wikidata items which are related but not the same.