Odpravljanje napak v datotekah XML

Gre za napake, zaradi katerih se zatakne XML parser (uporabljam paket XML v R-u). Pomembno: Iste napake so bile prisotne tudi v prejšnji verziji podatkov, tako da gre najbrž za sistemske probleme oz probleme z vnosom.

Napaka 1:

Opis napake:
Pojavlja se stalna napaka kjer manjka uvoden Tag <SeriesInfoISBD>. Ker napaka vedno sledi “</SeriesInfo>******” , kjer so * vsebina pokvarjenega zapisa , closing tag </SeriesInfoISBD> enostavno pobrišemo.
R-koda:
txt[which(gregexpr("</SeriesInfo>....",txt)!=-1)+1]<-""

Datoteke:
“bib211_00249.xml” “bib211_00956.xml” “bib211_00962.xml” “bib211_04628.xml”
“bib211_09085.xml” “bib211_10720.xml” “bib211_00300.xml” “bib211_01927.xml”
“bib211_02895.xml” “bib211_03707.xml” “bib211_04121.xml” “bib211_08504.xml”
“bib211_09593.xml” “bib211_09975.xml” “bib211_15069.xml”

Napaka 2: 

Opis napake:
#xmlParseCharRef: invalid xmlChar value 0

R-koda:
txt<-gsub("(&#x....)č","\\1;",txt) #uporabljen je regexpression substitute (zamenjamo č s podpičjem ; )

Datoteke:
“bib211_00582.xml” “bib211_01859.xml” “bib211_05958.xml” “bib211_07560.xml”
“bib211_10180.xml” “bib211_11517.xml” “bib211_19277.xml”

Napaka 3: 

Opis napake:
#xmlParseCharRef: invalid xmlChar value 0

R-koda:
txt<-gsub("&#61566,","",txt)

Datoteke:
“bib211_05008.xml” “bib211_05098.xml” “bib211_10412.xml” “bib211_07914.xml”
“bib211_18302.xml”

Napaka 4:

Opis napake:
Prelom vrstice znotraj “taga”

R-koda:
txt<-paste(txt, sep="", collapse="\n");txt<-gsub("\n","",txt)

Datoteke:
“bib211_01119.xml” “bib211_07625.xml” “bib211_03370.xml”

Napaka 5:

Opis napake:
#PCDATA invalid Char value 11

R-koda:
txt<-gsub("\v","",txt)

Datoteke:
[1] “bib211_00736.xml” “bib211_00986.xml” “bib211_04423.xml” “bib211_07254.xml”
[5] “bib211_08137.xml” “bib211_02565.xml” “bib211_15299.xml” “bib211_03480.xml”

Napaka 6:

Opis napake:
Dva neprimerna znaka (vedno nastopata v paru): #PCDATA invalid Char value 15, #PCDATA invalid Char value 16

R-koda:
txt<-gsub("\017","",txt)
txt<-gsub("\020","",txt)

Datoteke:
“bib211_05614.xml” “bib211_04001.xml” “bib211_10038.xml” “bib211_05320.xml” “bib211_11222.xml” “bib211_07469.xml”

Napaka 7:

Opis napake:
Dva neprimerna znaka (vedno nastopata v paru): #PCDATA invalid Char value 17, #PCDATA invalid Char value 18

R-koda:
txt<-gsub("\021","",txt)
txt<-gsub("\022","",txt)

Datoteke:
“bib211_04001.xml” “bib211_05320.xml” “bib211_07469.xml” “bib211_10038.xml” “bib211_11222.xml”

Napaka 8:

Še trije posebni primeri (Spet “/SeriesInfo” problem samo v malo drugačnih kombinacijah)

R-koda:
txt<-gsub("</SeriesInfo>C4-0568","</SeriesInfo>",txt)
txt[which(gregexpr("</SeriesInfo>....",txt)!=-1)+1]<-""

Datoteka:
“bib211_00936.xml”

R-koda:
txt[which(gregexpr("<SeriesInfoISBD>SBK ; LVIII-31",txt)!=-1)-1]<-""

Datoteki:
“bib211_05156.xml” “bib211_08487.xml”

 

Dodaj odgovor

Vaš e-naslov ne bo objavljen. * označuje zahtevana polja