Data from a Stone:

PDF Ghettos and Aid Transparency

problem

no one quite knows where aid ends up or what it's used for

I know

it sounds ridiculous.

but USAID gave us $25m to investigate

AidData's Goal:

who provides aid, to whom, where, when, for what?

what does this look like?

why I'm here today

we want to automate the extraction of entities

like organizations, dates, amounts of money, place names

from aid documents

and organize documents by topic

to reduce the number of documents we need to look at by hand

how we achieve our goal: geocoding and activity coding

geocoding

where? (pulling from >8m place names)

activity coding

what? ( 544 descriptive codes in use)

activity codes include . . .

21040.02: "harbor guidance systems"

31150.03: "Supply of fertilizers"

16010.10: "Social mitigation of HIV/AIDS"

AidData sources:

sources vary

OECD CRS

"Creditor Reporting System"
~2.3m records from 1979 - 2011

IATI

"International Aid Transparency Initiative"
donor contributed records

AMP

"Aid Management Platform" country by country databases of aid received

World Bank alone has 156 document types

problems with input aid documents

docs missing or withheld by many donors (eg Saudi Arabia / China)

problems with official published records of aid

double counting (huge)

DAC members ( official donors club )

example inputs

the PDF Ghetto

DOC/DOCX

spreadsheets

auto geocoding attempts

afghanistan?

aid project locations ( with ~50% false positives )

for fun, compare to english wikipedia!

want to contribute?

automate activity coding

activity coding challenge
on github

appendix

AidData.org

AidData China has user contributed photos of Chinese projects in Africa

contact

adecatur@aiddata.org (me) sstewart@aiddata.org