Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

kainjow

Moderator emeritus
Original poster
Jun 15, 2000
7,958
7
I've got a Perl script that simply parses out HTML from the standard input, and then outputs the result. However, for UTF-8/Unicode text (still not 100% clear on the difference between these encodings...), the output is all garbled. Anyone have any ideas?

Here's the Perl code:
Code:
#!/usr/bin/perl
$str = "";
while ($line=<STDIN>)
{
	$str .= $line;
}
$str =~ s/<script[^>]*>(.*?)<\/script>//gsi; # remove <script>
$str =~ s/<(?:[^>'"]*|(['"]).*?)*>//gsi; # remove html
print $str;
 

SilentPanda

Moderator emeritus
Oct 8, 2002
9,992
31
The Bamboo Forest
What I would probably do is figure out how to detect if it's non-ASCII and then convert it to ASCII before running the regex on it. I'm not a big Perl guy but that's what I'd do in most any other language.

And the complimentary link that I will leave on your pillow.
 

kainjow

Moderator emeritus
Original poster
Jun 15, 2000
7,958
7
OK here's the problem. I'm now playing with PHP since it's a little easier to use for now, and so I recreated the script and tested it by not doing any regex replacing on the string, and when I output the script, it's still garbled. So I need to find out a way of correctly inputting UTF-8/16 text and outputting it the same way:
Code:
#!/usr/bin/php
<?php
$html = "";
while ($line = fgets(STDIN))
	$html .= $line;
print $html;
?>
I've attached a test file you can use to test it. The way I've been testing it is
Code:
cat testfile.txt | ./striphtml.php > output.txt
(I doubled checked to make sure cat was working find, and it outputs correctly). So I have no clue now. :(
 

kainjow

Moderator emeritus
Original poster
Jun 15, 2000
7,958
7
OK I'm stupid ;) In my script, I had a blank line in between the #! and <?php line above, which was outputted, and the input was UTF-16 (2 bytes per char) and so it was outputting 1 byte for a \n character, and that was throwing everything off :eek:

So basically I need to figure out how to work with UTF-16 data. So far haven't had any luck :(.. need to figure out a way for Perl to automatically determine the encoding of the text...
 

kainjow

Moderator emeritus
Original poster
Jun 15, 2000
7,958
7
Now I'm getting desperate. I've tried every variation I could find and think of with PHP and Perl to get this to work. Basically it doesn't work properly with UTF-16, but UTF-8 is fine.

Background: I'm calling this script (doesn't really matter if it's PHP or Perl) from a Cocoa app to work on some text. Perl does regular expressions the fastest, that's why I'm using it. But I found some Cocoa code (OgreKit) that works with UTF-16, but it's super slow.

So any more ideas? :(
 

superbovine

macrumors 68030
Nov 7, 2003
2,872
0
from terminal type "man iconv"

if you look in the cowzilla cvs there is an example of converting UTF-8 to ISO-8859-1, should be able to do UTF-16 without any problems.

PHP:
<?php

//....from cowzilla

      iconv_set_encoding("internal_encoding", $this->feed->charset);
      iconv_set_encoding("output_encoding", $this->feed->charset);
      iconv_set_encoding("input_encoding", $this->feed->charset); 



//...

//what you need, or something close to it.


iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-16");
?>


?>
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.