Gre za napake, zaradi katerih se zatakne XML parser (uporabljam paket XML v R-u). Pomembno: Iste napake so bile prisotne tudi v prejšnji verziji podatkov, tako da gre najbrž za sistemske probleme oz probleme z vnosom.
Napaka 1:
Opis napake:
Pojavlja se stalna napaka kjer manjka uvoden Tag <SeriesInfoISBD>. Ker napaka vedno sledi “</SeriesInfo>******” , kjer so * vsebina pokvarjenega zapisa , closing tag </SeriesInfoISBD> enostavno pobrišemo.
R-koda:
txt[which(gregexpr("</SeriesInfo>....",txt)!=-1)+1]<-""
Datoteke:
“bib211_00249.xml” “bib211_00956.xml” “bib211_00962.xml” “bib211_04628.xml”
“bib211_09085.xml” “bib211_10720.xml” “bib211_00300.xml” “bib211_01927.xml”
“bib211_02895.xml” “bib211_03707.xml” “bib211_04121.xml” “bib211_08504.xml”
“bib211_09593.xml” “bib211_09975.xml” “bib211_15069.xml”
Napaka 2:
Opis napake:
#xmlParseCharRef: invalid xmlChar value 0
R-koda:
txt<-gsub("(&#x....)č","\\1;",txt) #uporabljen je regexpression substitute (zamenjamo č s podpičjem ; )
Datoteke:
“bib211_00582.xml” “bib211_01859.xml” “bib211_05958.xml” “bib211_07560.xml”
“bib211_10180.xml” “bib211_11517.xml” “bib211_19277.xml”
Napaka 3:
Opis napake:
#xmlParseCharRef: invalid xmlChar value 0
R-koda:
txt<-gsub(",","",txt)
Datoteke:
“bib211_05008.xml” “bib211_05098.xml” “bib211_10412.xml” “bib211_07914.xml”
“bib211_18302.xml”
Napaka 4:
Opis napake:
Prelom vrstice znotraj “taga”
R-koda:
txt<-paste(txt, sep="", collapse="\n");txt<-gsub("\n","",txt)
Datoteke:
“bib211_01119.xml” “bib211_07625.xml” “bib211_03370.xml”
Napaka 5:
Opis napake:
#PCDATA invalid Char value 11
R-koda:
txt<-gsub("\v","",txt)
Datoteke:
[1] “bib211_00736.xml” “bib211_00986.xml” “bib211_04423.xml” “bib211_07254.xml”
[5] “bib211_08137.xml” “bib211_02565.xml” “bib211_15299.xml” “bib211_03480.xml”
Napaka 6:
Opis napake:
Dva neprimerna znaka (vedno nastopata v paru): #PCDATA invalid Char value 15, #PCDATA invalid Char value 16
R-koda:
txt<-gsub("\017","",txt)
txt<-gsub("\020","",txt)
Datoteke:
“bib211_05614.xml” “bib211_04001.xml” “bib211_10038.xml” “bib211_05320.xml” “bib211_11222.xml” “bib211_07469.xml”
Napaka 7:
Opis napake:
Dva neprimerna znaka (vedno nastopata v paru): #PCDATA invalid Char value 17, #PCDATA invalid Char value 18
R-koda:
txt<-gsub("\021","",txt)
txt<-gsub("\022","",txt)
Datoteke:
“bib211_04001.xml” “bib211_05320.xml” “bib211_07469.xml” “bib211_10038.xml” “bib211_11222.xml”
Napaka 8:
Še trije posebni primeri (Spet “/SeriesInfo” problem samo v malo drugačnih kombinacijah)
R-koda:
txt<-gsub("</SeriesInfo>C4-0568","</SeriesInfo>",txt)
txt[which(gregexpr("</SeriesInfo>....",txt)!=-1)+1]<-""
Datoteka:
“bib211_00936.xml”
R-koda:
txt[which(gregexpr("<SeriesInfoISBD>SBK ; LVIII-31",txt)!=-1)-1]<-""
Datoteki:
“bib211_05156.xml” “bib211_08487.xml”