Wikidata&Amino-acids
Yesterday I have been working on annotating a data set, which contained lots of amino acids. For this annotation I made use of the Wikidata database. The fun thing with this database, is that it is very structured. You can do for example SPARQL queries on it (and even though I wasn't familiar with these before I started my PhD, I rather enjoy them now). Below is an example query, which gives all proteinogenic coding L-amino acids (so the active forms of the amino acids, which are being build into proteins through transcription).
SELECT ?ID ?IDLabel
WHERE
{
?ID wdt:P279 wd:Q8066 .
?ID wdt:P279 wd:Q24301658 .
?ID wdt:P279 wd:Q3241589 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
This gave me 19 results (in 244 ms), and to my surprise there was a mistake in the results: D-isoleucine was labeled as an L-amino acid. So I went in Wikidata again and fixed this issue (which was quit easy since the result from the query contained clickable links). After my change, I redid the query above and what do you know? The results went down to 18 (so my change was almost instantaneously adapted for the query service of Wikidata).
On to the second problem: last time I looked into amino acids (during my bachelor education), there where 20 (or according to some literature 22) proteinogenic amino acids. Apparently the data set I looked at (which triggered the amino acids work for me) didn't analyse all proteinogenic amino acids. So, to find out which ones where missing, I consulted Wikipedia, which had an outstanding list on amino acids! To make my comparison easier, I added a sorting function in the query (ORDER BY ?IDLabel ) and what do you know....
Alanine and aspargine seems to be missing, but where found later on in the list.... apparently the ODRER BY function looks at capitals too....
So I first changed the names of the amino acids I was looking into (there is a way to work around this capitalisation issue with RegEX, but I will look into that later.
Now I can see that Cysteine is missing from the list, as well as Glycine(which makes sense, since glycine only has a hydrogen as a group, which makes D or L configuration impossible). So, I looked at the Wikidata page of cysteine and added a "subclass of" amino acid/L-amino acid/proteinogenic amino acid subject. So, the proteinogenic amino acids having a L-configuration can now be retrieved from Wikidata with one simple query.
And if you want to know how many different combinations of nucleobases (ATCG's) a specific amino acid can be encoded, you can uses this query: (which resulted in 46 unique 3 base coded.... still more work to be done apparently!).
SELECT ?ID ?IDLabel (COUNT(?CodedBy) AS ?count)
WHERE {
?ID wdt:P279 wd:Q8066.
?ID wdt:P279 wd:Q24301658.
?ID wdt:P279 wd:Q3241589.
?ID wdt:P702 ?CodedBy .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?ID ?IDLabel
ORDER BY DESC(?count)
SELECT ?ID ?IDLabel
WHERE
{
?ID wdt:P279 wd:Q8066 .
?ID wdt:P279 wd:Q24301658 .
?ID wdt:P279 wd:Q3241589 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
This gave me 19 results (in 244 ms), and to my surprise there was a mistake in the results: D-isoleucine was labeled as an L-amino acid. So I went in Wikidata again and fixed this issue (which was quit easy since the result from the query contained clickable links). After my change, I redid the query above and what do you know? The results went down to 18 (so my change was almost instantaneously adapted for the query service of Wikidata).
On to the second problem: last time I looked into amino acids (during my bachelor education), there where 20 (or according to some literature 22) proteinogenic amino acids. Apparently the data set I looked at (which triggered the amino acids work for me) didn't analyse all proteinogenic amino acids. So, to find out which ones where missing, I consulted Wikipedia, which had an outstanding list on amino acids! To make my comparison easier, I added a sorting function in the query (ORDER BY ?IDLabel ) and what do you know....
Alanine and aspargine seems to be missing, but where found later on in the list.... apparently the ODRER BY function looks at capitals too....
So I first changed the names of the amino acids I was looking into (there is a way to work around this capitalisation issue with RegEX, but I will look into that later.
Now I can see that Cysteine is missing from the list, as well as Glycine(which makes sense, since glycine only has a hydrogen as a group, which makes D or L configuration impossible). So, I looked at the Wikidata page of cysteine and added a "subclass of" amino acid/L-amino acid/proteinogenic amino acid subject. So, the proteinogenic amino acids having a L-configuration can now be retrieved from Wikidata with one simple query.
And if you want to know how many different combinations of nucleobases (ATCG's) a specific amino acid can be encoded, you can uses this query: (which resulted in 46 unique 3 base coded.... still more work to be done apparently!).
SELECT ?ID ?IDLabel (COUNT(?CodedBy) AS ?count)
WHERE {
?ID wdt:P279 wd:Q8066.
?ID wdt:P279 wd:Q24301658.
?ID wdt:P279 wd:Q3241589.
?ID wdt:P702 ?CodedBy .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?ID ?IDLabel
ORDER BY DESC(?count)
Reacties
Een reactie posten