Skip to content Skip to sidebar Skip to footer

Javascript Variable With Html Code Regex Email Matching

This python script is not working to output the email address example@email.com for this case. This was my previous post. How can I use BeautifulSoup or Slimit on a site to output

Solution 1:

Here is a rather interesting (I think) approach.

Instead of parsing this javascript code - execute it!

Get the ptr value, load it via BeautifulSoup and get the href attribute value from the a tag. Example using V8 engine:

from bs4 import BeautifulSoup
from pyv8 import PyV8

data = """
<script LANGUAGE="JavaScript">
function something()
{
var ptr;
ptr = "";
ptr += "<table><td class=france></td></table>";
ptr += "<table><td class=france><a href=mail";
ptr += "to:example@email.com>email</a></td></table>";
document.all.something.innerHTML = ptr;
}
</script>
"""

soup = BeautifulSoup(data)

# prepare the function to return a value and add a function call
js_code = soup.script.text.strip().replace('document.all.something.innerHTML = ptr;', 'return ptr;') + "; something()"

ctxt = PyV8.JSContext()
ctxt.enter()

soup = BeautifulSoup(ctxt.eval(str(js_code)))
print soup.a['href'].split('mailto:')[1]

Prints:

example@email.com

Solution 2:

Your problem is that you can't find "mailto" in your text, because the first half "mail" is not in the same line as the second half "to". To solve your problem properly only have to know the value of ptr at the end of this program.

I know that this is a bad way to do it, but if you are sure that the structure is always like this:

soup = """
<script LANGUAGE="JavaScript"> function ...() 
{ var ptr; 
ptr = ""; 
ptr += "..."; 
ptr += "..."; 
ptr += "...";
document.all.something.innerHTML = ptr; 
}
</script> 
"""

You can use this:

soup = BeautifulSoup(soup)

for script in soup.find_all('script'):
    #This matches everything between "{ var ptr;" 
    #and "document"
    regex = "{ var ptr;(.*)document"
    code = re.search(regex, script.text, flags=re.DOTALL).groups()[0]
    #This is actually dangerous because anything 
    #in the code will be executed here, but if
    #it's like your example everything will 
    #work fine and you can access the value of ptr
    exec(code)
    print ptr

Now you can use either Beautifulsoup or re to parse ptr. If you don't how it's structured, you can use this:

    mail = re.search("<a href=mailto:(.*?)>", ptr).groups()[0]

Post a Comment for "Javascript Variable With Html Code Regex Email Matching"