filter paragraphs from a XML using PHP and HTMLPurifier, SimpleXmlElement or DOM -
i'm trying remove social media buttons, leaving paragraphs, description field of xml here (it's big post here).
edit: since of couldn't access xml, follow part of 1 of description tags:
<description> <!-- twitter https://twitter.com/about/resources/buttons#tweet --> <script> document.write('<a href="https://www.twitter.com/tst_oficial" class="twitter-follow-button" data-show-count="false" data-lang="pt">seguir</a>'); !function(d,s,id){var js,fjs=d.getelementsbytagname(s)[0];if(!d.getelementbyid(id)){js=d.createelement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentnode.insertbefore(js,fjs);}}(document,"script","twitter-wjs");</script> <!-- curtir site facebook (enviar) --> <iframe class="fb_ltr" src="http://www.facebook.com/plugins/like.php?href=https://www.facebook.com/tstjus&layout=button_count&show_faces=false&action=like&colorscheme=light&width=25&height=25&locale=pt_br" scrolling="no" frameborder="0" style="border:0px; margin-left:30px; overflow:hidden; width:120px; height:25px;vertical-align:bottom;" allowtransparency="true"></iframe> <!-- google plus +1--> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> <g:plusone size="medium" href="https://plus.google.com/103151838647081346830" style="border-left:-200px"></g:plusone> </div> </br></br> <div class="modelo_noticia"> <div> <div style="float: left; width:47%; text-align:center; margin: 0 9px 0 0;"><a href="/image/journal/article?img_id=5733388&t=1377023456174" target="_blank" style="text-decoration:none; color:black;"><img src="/image/journal/article?img_id=5733388&t=1377023456174" style="margin: 0 5px; width:98%;"/><span style="font-style:italic;"></span> </a></div> <p> </p> <p style="text-align: justify;"> <span style="font-size:12px;">"a clt continua atual enq...a.</span></p> <p style="text-align: justify;"> <span style="font-size:12px;">...or.</span></p> <p style="text-align: justify;"> <span style="font-size:12px;">o min...do".</span></p> <p style="text-align: justify;"> <span style="font-size:12px;">ca...as".</span></p> <p style="text-align: justify;"> <span style="font-size:12px;">ao enc...izou.</span></p> <p style="text-align: justify;"> <span style="font-size:12px;">também parti...o.</span></p> <p style="text-align: justify;"> <span style="font-size:12px;">ao a...ócio".</span></p> <p style="text-align: justify;"> <span style="font-size:12px;"><strong>debate: reforma na clt</strong></span></p> <p style="text-align: justify;"> <span style="font-size:12px;">o min...s.</span></p> <p style="text-align: justify;"> <span style="font-size:12px;">ao...disse.</span></p> <p style="text-align: justify;"> <span style="font-size:12px;">o m...o o país". </span></p> <p style="text-align: justify;"> <span style="font-size:12px;">(fernanda loureiro)</span></p> </div> <div style="clear:both;"></div> </div> <div style="vertical-align:bottom !important"> <!-- facebook curtir --> <!-- <script src="http://connect.facebook.net/pt_br/all.js#xfbml=1"></script> <fb:like layout="button_count" show_faces="true" width="80"></fb:like>--> <iframe class="fb_ltr" src="http://www.facebook.com/plugins/like.php?href=http://www.tst.jus.br/noticias/-/asset_publisher/89dk/content/{rss=true}&layout=button_count&show_faces=false&action=like&colorscheme=light&width=25&height=25&locale=pt_br" scrolling="no" frameborder="0" style="border:none;border:0;margin-left:0; overflow:hidden; width:95px; height:25px;horizontal-align:left;vertical-align:bottom;" allowtransparency="true"></iframe> <!-- twittar --> <span style="margin-left:20px;"> <script type="text/javascript"> var endereco; endereco = window.location.href; document.write('<a href="http://twitter.com/share?url=' + endereco + '" class="twitter-share-button" data-text="presidente tst diz que trabalho precisa ser valorizado sem perda de competitividade" data-count="horizontal" data-via="tst_oficial">tweet</a>') </script><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script> </span> <!-- ok facebook recomendar --> <!--<iframe id="f2ee48257c" name="f1f8d54994" frameborder="0" scrolling="no" style="border: none; overflow: hidden; height: 20px; width: 200px;" title="like content on facebook." class="fb_ltr" src="http://www.facebook.com/plugins/like.php?api_key=228619377180035&locale=pt_br&sdk=joey&channel_url=http://www.facebook.com/tstjus?fref=ts&version=18%23cb%3df360a99c9c&origin=http://www.tst.jus.br/noticias&href=http://www.tst.jus.br/noticias%26relation%3dparent.parent&node_type=link&width=150&font=arial&layout=button_count&colorscheme=light&show_faces=false&send=true&extended_social_context=false&action=recommend" allowtransparency="true"></iframe>--> <iframe border="0" frameborder="0" scrolling="no" class="fb_ltr" id="f2ee48257c" name="f1f8d54994" style="border:none;margin-left:0; overflow:hidden; width:200px; height:25px;horizontal-align:left;vertical-align:bottom;" allowtransparency="true" title="enviar notícia no facebook" class="fb_ltr" src="http://www.facebook.com/plugins/like.php?api_key=228619377180035&locale=pt_br&sdk=joey&channel_url=http://www.tst.jus.br/noticias%3fversion%3d18%23cb%3df360a99c9c%26origin%3dhttp://www.tst.jus.br/noticias%26relation%3dparent.parent&href=http://www.tst.jus.br/noticias&node_type=link&width=150&font=arial&layout=button_count&colorscheme=light&show_faces=false&send=true&extended_social_context=false&action=recommend"></iframe> <!-- youtube --> <a href="http://www.youtube.com/tst" target="_blank"> <img src="http://www.tst.jus.br/image/image_gallery?uuid=49d1dfeb-fba6-48be-9984-c2ba7dac709e&groupid=10157&t=1359131490760" border="0" title="inscrição no canal youtube tst" alt="inscrição no canal youtube tst"></a> </div> </br> </description>
i've tried using regex, first paragraph ('#<p[^>]*>(.*)</p>#isu'
). simplexmlelement, dom, keep getting errors (i don't know them, seem best way it) , htmlpurifier, filters , returns nothing relevant.
here how did @ end (following puggan se's suggestion):
$i=0; $feed= '<xml string>'; //the whole xml string here $dom = new domdocument(); //declaring domdocument $dom->preservewhitespace = false; //removing spaces $dom->loadxml($feed, libxml_parsehuge); //libxml_parsehuge long xmls $dom->formatoutput = true; // nice output ?? $xml = new domxpath($dom); //declaring xpath $xml->registernamespace('a','http://purl.org/dc/elements/1.1/'); //getting namespace xml //evaluates $source = $xml->evaluate("//channel/title"); $titles = $xml->evaluate("//item/title"); $links = $xml->evaluate("//item/link"); $dates = $xml->evaluate("//item/dc:date"); $descriptions = $xml->evaluate("//item/description"); //echoing channel's title if($source->length > 0) { $source= $source->item(0)->nodevalue; echo $source. '<br /><br />'; } //echoing items foreach($titles $title) { echo "{$titles->item($i)->nodevalue}<br /><br />"; echo "{$links->item($i)->nodevalue}<br /><br />"; echo "{$dates->item($i)->nodevalue}<br /><br />"; //filtering <p><span> text <description> $description = "{$descriptions->item($i)->nodevalue} "; $description = mb_convert_encoding($conteudo, 'html-entities', 'utf-8'); unset($domtmp); $domtmp = new domdocument(); $domtmp->loadhtml($description ); $xmltmp = new domxpath($domtmp); $desc= $xmltmp->evaluate("//p/span"); foreach($desc $node) { echo "<p>{$node->nodevalue}</p>"; } $i++; }
do know how improve it?
thank much, help!
Comments
Post a Comment