Wednesday, January 22, 2014

TambonBot on Wikidata

It took quite some time, but some days ago my automatic Wikidata editing bot has been approved, and it already did 12,000 edits on the 2428 administrative entities which have a corresponding Wikidata item. So far I have done only the trivial things which don't necessarily includes the addition of sub-statement to statements or sources. The activities done so far were
  • Normalize the item names not to include the type, e.g. Bueng Kan Province became Bueng Kan, both for English and for German.
  • For Thai however, the name always includes the type, thus Bueng Kan Province is labeled จังหวัดบึงกาฬ.
  • Give a description with the full hierarchy to be avoid any potential ambiguities, e.g. "district in Bueng Kan province, Thailand". For German I haven't implemented it yet as the Grammar makes it a little bit more complicate, for English and Thai it was simple string concatenation.
  • Every item now has the link to the country Thailand
  • Every item now is linked to the one in which it is located, except for the TAO and Thesaban - I am not sure if I should link the province, the district or every (partially) covered Tambon.
  • The type of the entity is also linked, for some reasons twice, once as "instance of" and once as "type of administrative unit".
  • Those entities which have a corresponding boundary item on OpenStreetMap are now linked as well.
As both the parent unit as well as the type could have changed in past, adding the historical values with the corresponding start and end dates is still an open task to be programmed.

A property to hold the geocode of the entity has been created by now as well - when I saw that it received the property id 1067 I realized I should have waited a bit longer to catch the number 1099 - as this code is related with the TIS 1099 standard. The code to add these identifiers is nearly finished, still need some polishing to add the references to the corresponding source of the code - TIS 1099:2535, TIS 1099:2548 or the full code list from DOPA. While waiting for this property to be created, I finally wrote down an article on Wikipedia about this Thai standard - copyreading or translation is welcome...

Also almost completed is the code to fill the list of subdivisions, in this case clearly leaving out the TAO and municipalities, as these are no real subordinate of any of the central administrative units. There are several other edits which are easiest done by a bot, I am collecting my ideas on the bot userpage. The item on Bueng Kan is kind of my test item, having the biggest number of statements of all the Thai subdivisions now, and already takes quite long to load in a Webbrowser.

I still learn more about what Wikidata can do - like discovering more properties which can be applied to the administrative unit I work with, as well as discovering of what it will be able to do in future once developing progresses - e.g. the data value for the population number is not yet available. But I also had my first negative experience there: As Phuket is the province with the smallest number of subdivisions, I added all those items which have no Wikipedia article yet as more-or-less blank items to be filled by my bot later (except the PAO, those are a really special thing). As the idea behind Wikidata is to be more than just a repository of data for Wikipedia articles, these items were perfectly fine to be added. But since there was no Wikipedia link as well as no link from any other item, one admin thought them to be unused orphans and deleted them, without notifying me or asking whether these were correct or not. So I had to do the same work twice, the only positive thing was that now all of these items contain the list of neighboring items, to make sure none looks like an unused orphan anymore.

1 comment:

Andy said...

Yes, sure, they were all upgraded to full Amphoe. I guess you refer to the first page of the spreadsheet with the numbers of subdivisions for each province. That page I didn't update for quite some time, you might notice that Bueng Kan is also missing there...