[Taxacom] FW: Wikipedia and Geonames. was: AW: ANN: RDF Book Mashup - IntegratingWeb 2.0 data sources like Amazon and Google into the Semantic Web
Donat Agosti
agosti at amnh.org
Sat Dec 2 04:37:52 CST 2006
Here a new thread in simile, a semantic web list serve, which might interest
those involved in extracting geographic names.
Donat
-----Original Message-----
From: general-bounces at simile.mit.edu [mailto:general-bounces at simile.mit.edu]
On Behalf Of Chris Bizer
Sent: Saturday, December 02, 2006 10:36 AM
To: Richard Cyganiak; Richard Newman
Cc: semantic-web at w3.org; 'Damian Steer'; general at simile.mit.edu; 'Karl
Dubost'
Subject: Wikipedia and Geonames. was: AW: ANN: RDF Book Mashup -
IntegratingWeb 2.0 data sources like Amazon and Google into the Semantic Web
>>> I wish that wikipedia had a fully exportable database
>>> http://en.wikipedia.org/wiki/Lists_of_films
>>>
>>> For example, being able to export all data of this movie as RDF,
>>> maybe a templating issue at least for the box on the right.
>>> http://en.wikipedia.org/wiki/2046_%28film%29
>>
>> Should be an easy job for a SIMILE like screen scraper.
>>
>> If you start scraping down from the Wikipedia film list, you should get
>> a fair amount of data.
Some further ideas along these lines. What about scraping information about
geograpic places like countries and cities from Wikipedia and linking the
data to geonames (http://www.geonames.org/ontology/)?
Something like http://XXX/wikipedia/Embrun owl:sameAs
http://sws.geonames.org/3020251/
The Wikipedia articles about countries and cities all follow relatively
similar structures (for instance http://en.wikipedia.org/wiki/Berlin) so it
should be easy to scrape them. They already contain links to other places,
like the Boroughs and localities in Berlin, which could easily be
transformed to RDF links.
Many places have geo-coordinates which together with the place name allow
scrapers to automatically create links to localities from geonames.
Wikipedia is GNU, thus there aren't any problems with licensing as with the
Google and Amazon data.
As most articles follow the same structure, an approach to implement such a
information service could be to:
- Use a crawling/screenscraping framework that fills a relational database
with the information from Wikipedia.
- Use D2R Server (http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/)
to publish the database on the Web and to provide a SPARQL end-point for
querying.
I once read about some pretty sophisticated screen-scraping frameworks that
fill relational databases with data from websites but forgot the exact
links. Does anybody know?
Cheers,
Chris
----- Original Message -----
From: "Richard Cyganiak" <richard at cyganiak.de>
To: "Richard Newman" <r.newman at reading.ac.uk>
Cc: "Chris Bizer" <chris at bizer.de>; "'Karl Dubost'" <karl at w3.org>; "'Damian
Steer'" <damian.steer at hp.com>; <semantic-web at w3.org>
Sent: Friday, December 01, 2006 7:19 PM
Subject: Re: AW: ANN: RDF Book Mashup - Integrating Web 2.0 data sources
like Amazon and Google into the Semantic Web
>
> On 1 Dec 2006, at 18:27, Richard Newman wrote:
>> Systemone have Wikipedia dumped monthly into RDF:
>>
>> http://labs.systemone.at/wikipedia3
>>
>> A public SPARQL endpoint is on their roadmap, but it's only 47 million
>> triples, so you should be able to load it in a few minutes on your
>> machine and run queries locally.
>
> Unfortunately this only represents the hyperlink structure and basic
> article metadata in RDF. It does no scraping of data from info boxes or
> article content. Might be interesting for analyzing Wikipedia's link
> structure or social dynamics, but not for content extraction.
>
> Richard
>
>
>
>>
>> -R
>>
>>
>> On 1 Dec 2006, at 4:30 AM, Chris Bizer wrote:
>>
>>>> I wish that wikipedia had a fully exportable database
>>>> http://en.wikipedia.org/wiki/Lists_of_films
>>>>
>>>> For example, being able to export all data of this movie as RDF,
>>>> maybe a templating issue at least for the box on the right.
>>>> http://en.wikipedia.org/wiki/2046_%28film%29
>>>
>>> Should be an easy job for a SIMILE like screen scraper.
>>>
>>> If you start scraping down from the Wikipedia film list, you should get
>>> a
>>> fair amount of data.
>>>
>>> To all the Semantic Wiki guys: Has anybody already done something like
>>> this?
>>> Are there SPARQL end-points/repositories for Wikipedia-scraped data?
>>
>>
>>
>
>
_______________________________________________
General mailing list
General at simile.mit.edu
http://simile.mit.edu/mailman/listinfo/general
More information about the Taxacom
mailing list